WEBVTT

00:00:06.804 --> 00:00:10.610
[students murmuring]

00:00:10.610 --> 00:00:15.375
- Okay, so good afternoon
everyone, let's get started.

00:00:15.375 --> 00:00:18.257
So hi, so for those of
you who I haven't met yet,

00:00:18.257 --> 00:00:21.470
my name is Serena Yeung and I'm the third

00:00:21.470 --> 00:00:24.853
and final instructor for this class,

00:00:24.853 --> 00:00:28.307
and I'm also a PhD student
in Fei-Fei's group.

00:00:28.307 --> 00:00:31.292
Okay, so today we're going
to talk about backpropagation

00:00:31.292 --> 00:00:33.843
and neural networks, and so
now we're really starting

00:00:33.843 --> 00:00:37.346
to get to some of the core
material in this class.

00:00:37.346 --> 00:00:39.929
Before we begin, let's see, oh.

00:00:42.124 --> 00:00:44.166
So a few administrative details,

00:00:44.166 --> 00:00:47.602
so assignment one is due
Thursday, April 20th,

00:00:47.602 --> 00:00:51.522
so a reminder, we shifted
the date back by a little bit

00:00:51.522 --> 00:00:55.105
and it's going to be due
11:59 p.m. on Canvas.

00:00:56.722 --> 00:00:59.316
So you should start thinking
about your projects,

00:00:59.316 --> 00:01:02.327
there are TA specialties
listed on the Piazza website

00:01:02.327 --> 00:01:05.432
so if you have questions
about a specific project topic

00:01:05.432 --> 00:01:08.958
you're thinking about, you
can go and try and find

00:01:08.958 --> 00:01:12.682
the TAs that might be most relevant.

00:01:12.682 --> 00:01:15.910
And then also for Google Cloud,
so all students are going

00:01:15.910 --> 00:01:19.130
to get $100 in credits
to use for Google Cloud

00:01:19.130 --> 00:01:21.268
for their assignments and project,

00:01:21.268 --> 00:01:22.630
so you should be receiving an email

00:01:22.630 --> 00:01:24.285
for that this week, I think.

00:01:24.285 --> 00:01:25.666
A lot of you may have already, and then

00:01:25.666 --> 00:01:28.227
for those of you who haven't,
they're going to come,

00:01:28.227 --> 00:01:31.060
should be by the end of this week.

00:01:32.544 --> 00:01:35.721
Okay so where we are, so
far we've talked about

00:01:35.721 --> 00:01:39.207
how to define a classifier
using a function f,

00:01:39.207 --> 00:01:41.669
parameterized by weights
W, and this function f

00:01:41.669 --> 00:01:45.836
is going to take data x as input,
and output a vector of scores

00:01:48.027 --> 00:01:51.432
for each of the classes
that you want to classify.

00:01:51.432 --> 00:01:54.778
And so from here we can also define

00:01:54.778 --> 00:01:57.314
a loss function, so for
example, the SVM loss function

00:01:57.314 --> 00:02:00.445
that we've talked about
which basically quantifies

00:02:00.445 --> 00:02:02.053
how happy or unhappy we are with

00:02:02.053 --> 00:02:04.167
the scores that we've produced, right,

00:02:04.167 --> 00:02:07.727
and then we can use that to
define a total loss term.

00:02:07.727 --> 00:02:10.757
So L here, which is a
combination of this data term,

00:02:10.757 --> 00:02:14.617
combined with a regularization
term that expresses

00:02:14.617 --> 00:02:16.601
how simple our model is,
and we have a preference

00:02:16.601 --> 00:02:20.201
for simpler models, for
better generalization.

00:02:20.201 --> 00:02:23.295
And so now we want to
find the parameters W

00:02:23.295 --> 00:02:25.209
that correspond to our lowest loss, right?

00:02:25.209 --> 00:02:27.092
We want to minimize the loss function,

00:02:27.092 --> 00:02:28.497
and so to do that we want to find

00:02:28.497 --> 00:02:31.497
the gradient of L with respect to W.

00:02:33.275 --> 00:02:35.829
So last lecture we talked
about how we can do this

00:02:35.829 --> 00:02:37.735
using optimization, and we're going

00:02:37.735 --> 00:02:39.976
to iteratively take steps in the direction

00:02:39.976 --> 00:02:42.817
of steepest descent, which is
the negative of the gradient,

00:02:42.817 --> 00:02:44.819
in order to walk down this loss landscape

00:02:44.819 --> 00:02:47.291
and get to the point
of lowest loss, right?

00:02:47.291 --> 00:02:51.947
And we saw how this gradient
descent can basically take

00:02:51.947 --> 00:02:54.770
this trajectory, looking
like this image on the right,

00:02:54.770 --> 00:02:59.293
getting to the bottom
of your loss landscape.

00:02:59.293 --> 00:03:00.126
Oh!

00:03:02.247 --> 00:03:04.376
Okay, and so we also
talked about different ways

00:03:04.376 --> 00:03:06.154
for computing a gradient, right?

00:03:06.154 --> 00:03:08.538
We can compute this numerically

00:03:08.538 --> 00:03:10.696
using finite difference approximation

00:03:10.696 --> 00:03:13.444
which is slow and approximate,
but at the same time

00:03:13.444 --> 00:03:14.725
it's really easy to write out,

00:03:14.725 --> 00:03:17.589
you know you can always
get the gradient this way.

00:03:17.589 --> 00:03:21.064
We also talked about how to
use the analytic gradient

00:03:21.064 --> 00:03:23.113
and computing this is, it's fast

00:03:23.113 --> 00:03:25.227
and exact once you've
gotten the expression for

00:03:25.227 --> 00:03:27.499
the analytic gradient, but
at the same time you have

00:03:27.499 --> 00:03:29.670
to do all the math and the
calculus to derive this,

00:03:29.670 --> 00:03:32.734
so it's also, you know, easy
to make mistakes, right?

00:03:32.734 --> 00:03:34.925
So in practice what we want
to do is we want to derive

00:03:34.925 --> 00:03:37.620
the analytic gradient and use this,

00:03:37.620 --> 00:03:39.837
but at the same time check
our implementation using

00:03:39.837 --> 00:03:41.267
the numerical gradient to make sure

00:03:41.267 --> 00:03:44.600
that we've gotten all of our math right.

00:03:46.032 --> 00:03:48.518
So today we're going to
talk about how to compute

00:03:48.518 --> 00:03:52.261
the analytic gradient for
arbitrarily complex functions,

00:03:52.261 --> 00:03:55.824
using a framework that I'm going
to call computational graphs.

00:03:55.824 --> 00:03:58.506
And so basically what a
computational graph is,

00:03:58.506 --> 00:04:00.555
is that we can use this
kind of graph in order

00:04:00.555 --> 00:04:04.010
to represent any function,
where the nodes of the graph

00:04:04.010 --> 00:04:06.611
are steps of computation
that we go through.

00:04:06.611 --> 00:04:07.952
So for example, in this example,

00:04:07.952 --> 00:04:11.306
the linear classifier
that we've talked about,

00:04:11.306 --> 00:04:14.737
the inputs here are x and W, right,

00:04:14.737 --> 00:04:18.180
and then this multiplication
node represents

00:04:18.180 --> 00:04:21.321
the matrix multiplier,
the multiplication of

00:04:21.322 --> 00:04:25.038
the parameters W with
our data x that we have,

00:04:25.038 --> 00:04:27.478
outputting our vector of scores.

00:04:27.478 --> 00:04:29.469
And then we have another
computational node

00:04:29.469 --> 00:04:31.948
which represents our hinge loss, right,

00:04:31.948 --> 00:04:35.113
computing our data loss term, Li.

00:04:35.113 --> 00:04:38.198
And we also have this
regularization term at

00:04:38.198 --> 00:04:39.340
the bottom right, so this node

00:04:39.340 --> 00:04:42.964
which computes our regularization term,

00:04:42.964 --> 00:04:44.877
and then our total loss
here at the end, L,

00:04:44.877 --> 00:04:49.044
is the sum of the regularization
term and the data term.

00:04:50.624 --> 00:04:52.325
And the advantage is
that once we can express

00:04:52.325 --> 00:04:54.675
a function using a computational graph,

00:04:54.675 --> 00:04:58.103
then we can use a technique
that we call backpropagation

00:04:58.103 --> 00:05:00.261
which is going to recursively
use the chain rule

00:05:00.261 --> 00:05:02.227
in order to compute the gradient

00:05:02.227 --> 00:05:05.568
with respect to every variable
in the computational graph,

00:05:05.568 --> 00:05:09.830
and so we're going to
see how this is done.

00:05:09.830 --> 00:05:11.675
And this becomes very
useful when we start working

00:05:11.675 --> 00:05:13.413
with really complex functions,

00:05:13.413 --> 00:05:15.502
so for example,
convolutional neural networks

00:05:15.502 --> 00:05:17.836
that we're going to talk
about later in this class.

00:05:17.836 --> 00:05:20.900
We have here the input image at the top,

00:05:20.900 --> 00:05:22.364
we have our loss at the bottom,

00:05:22.364 --> 00:05:24.023
and the input has to
go through many layers

00:05:24.023 --> 00:05:26.269
of transformations in order to get all

00:05:26.269 --> 00:05:29.102
the way down to the loss function.

00:05:30.896 --> 00:05:33.165
And this can get even
crazier with things like,

00:05:33.165 --> 00:05:35.447
the, you know, like a
neural turing machine,

00:05:35.447 --> 00:05:37.869
which is another kind
of deep learning model,

00:05:37.869 --> 00:05:39.915
and in this case you can see
that the computational graph

00:05:39.915 --> 00:05:43.413
for this is really insane, and especially,

00:05:43.413 --> 00:05:45.897
we end up, you know,
unrolling this over time.

00:05:45.897 --> 00:05:47.562
It's basically completely impractical

00:05:47.562 --> 00:05:49.901
if you want to compute the gradients

00:05:49.901 --> 00:05:53.234
for any of these intermediate variables.

00:05:55.560 --> 00:05:59.139
Okay, so how does backpropagation work?

00:05:59.139 --> 00:06:01.838
So we're going to start
off with a simple example,

00:06:01.838 --> 00:06:03.672
where again, our goal is
that we have a function.

00:06:03.672 --> 00:06:06.228
So in this case, f of x, y, z

00:06:06.228 --> 00:06:08.614
equals x plus y times z,

00:06:08.614 --> 00:06:10.746
and we want to find the
gradients of the output of

00:06:10.746 --> 00:06:13.572
the function with respect
to any of the variables.

00:06:13.572 --> 00:06:15.365
So the first step, always, is we want

00:06:15.365 --> 00:06:17.329
to take our function f, and we want

00:06:17.329 --> 00:06:20.623
to represent it using
a computational graph.

00:06:20.623 --> 00:06:23.339
Right, so here our computational
graph is on the right,

00:06:23.339 --> 00:06:25.957
and you can see that we have our,

00:06:25.957 --> 00:06:27.892
first we have the plus node, so x plus y,

00:06:27.892 --> 00:06:30.044
and then we have this
multiplication node, right,

00:06:30.044 --> 00:06:34.060
for the second computation
that we're doing.

00:06:34.060 --> 00:06:36.201
And then, now we're going
to do a forward pass

00:06:36.201 --> 00:06:38.732
of this network, so given the values of

00:06:38.732 --> 00:06:40.782
the variables that we have, so here,

00:06:40.782 --> 00:06:42.944
x equals negative two, y equals five

00:06:42.944 --> 00:06:46.251
and z equals negative four,
I'm going to fill these all in

00:06:46.251 --> 00:06:49.417
in our computational graph,
and then here we can compute

00:06:49.417 --> 00:06:52.965
an intermediate value,
so x plus y gives three,

00:06:52.965 --> 00:06:55.045
and then finally we pass it through again,

00:06:55.045 --> 00:06:56.990
through the last node, the multiplication,

00:06:56.990 --> 00:07:00.823
to get our final node
of f equals negative 12.

00:07:04.310 --> 00:07:08.784
So here we want to give every
intermediate variable a name.

00:07:08.784 --> 00:07:10.438
So here I've called this
intermediate variable after

00:07:10.438 --> 00:07:14.997
the plus node q, and we
have q equals x plus y,

00:07:14.997 --> 00:07:18.609
and then f equals q times z,
using this intermediate node.

00:07:18.609 --> 00:07:21.341
And I've also written
out here, the gradients

00:07:21.341 --> 00:07:24.763
of q with respect to x
and y, which are just one

00:07:24.763 --> 00:07:28.131
because of the addition,
and then the gradients of f

00:07:28.131 --> 00:07:31.653
with respect to q and z,
which is z and q respectively

00:07:31.653 --> 00:07:34.607
because of the multiplication rule.

00:07:34.607 --> 00:07:36.598
And so what we want to
find, is we want to find

00:07:36.598 --> 00:07:40.431
the gradients of f with
respect to x, y and z.

00:07:43.356 --> 00:07:46.822
So what backprop is, it's
a recursive application of

00:07:46.822 --> 00:07:49.182
the chain rule, so we're
going to start at the back,

00:07:49.182 --> 00:07:51.311
the very end of the computational graph,

00:07:51.311 --> 00:07:53.360
and then we're going to
work our way backwards

00:07:53.360 --> 00:07:56.373
and compute all the
gradients along the way.

00:07:56.373 --> 00:07:59.015
So here if we start at
the very end, right,

00:07:59.015 --> 00:08:01.217
we want to compute the
gradient of the output

00:08:01.217 --> 00:08:04.674
with respect to the last
variable, which is just f.

00:08:04.674 --> 00:08:08.591
And so this gradient is
just one, it's trivial.

00:08:10.065 --> 00:08:12.604
So now, moving backwards,
we want the gradient

00:08:12.604 --> 00:08:15.687
with respect to z, right, and we know

00:08:16.637 --> 00:08:19.137
that df over dz is equal to q.

00:08:19.993 --> 00:08:22.636
So the value of q is just three,

00:08:22.636 --> 00:08:26.386
and so we have here, df
over dz equals three.

00:08:28.819 --> 00:08:31.688
And so next if we want to do df over dq,

00:08:31.688 --> 00:08:33.855
what is the value of that?

00:08:36.732 --> 00:08:37.957
What is df over dq?

00:08:37.957 --> 00:08:42.040
So we have here, df over
dq is equal to z, right,

00:08:46.012 --> 00:08:49.012
and the value of z is negative four.

00:08:49.963 --> 00:08:54.130
So here we have df over dq
is equal to negative four.

00:08:57.485 --> 00:09:00.802
Okay, so now continuing to
move backwards to the graph,

00:09:00.802 --> 00:09:03.793
we want to find df over dy, right,

00:09:03.793 --> 00:09:06.251
but here in this case, the
gradient with respect to y,

00:09:06.251 --> 00:09:08.986
y is not connected directly to f, right?

00:09:08.986 --> 00:09:12.838
It's connected through an
intermediate node of z,

00:09:12.838 --> 00:09:14.756
and so the way we're going to do this

00:09:14.756 --> 00:09:17.998
is we can leverage the
chain rule which says

00:09:17.998 --> 00:09:21.748
that df over dy can be
written as df over dq,

00:09:22.746 --> 00:09:26.754
times dq over dy, and
so the intuition of this

00:09:26.754 --> 00:09:30.707
is that in order to get to
find the effect of y on f,

00:09:30.707 --> 00:09:33.348
this is actually equivalent to if we take

00:09:33.348 --> 00:09:37.416
the effect of q times q on f,
which we already know, right?

00:09:37.416 --> 00:09:41.330
df over dq is equal to negative four,

00:09:41.330 --> 00:09:45.497
and we compound it with the
effect of y on q, dq over dy.

00:09:46.604 --> 00:09:50.986
So what's dq over dy
equal to in this case?

00:09:50.986 --> 00:09:51.819
- [Student] One.

00:09:51.819 --> 00:09:52.980
- One, right. Exactly.

00:09:52.980 --> 00:09:55.916
So dq over dy is equal to
one, which means, you know,

00:09:55.916 --> 00:09:57.984
if we change y by a little bit,

00:09:57.984 --> 00:09:59.408
q is going to change by approximately

00:09:59.408 --> 00:10:01.627
the same amount right, this is the effect,

00:10:01.627 --> 00:10:05.585
and so what this is
doing is this is saying,

00:10:05.585 --> 00:10:07.474
well if I change y by a little bit,

00:10:07.474 --> 00:10:10.807
the effect of y on q is going to be one,

00:10:13.249 --> 00:10:16.882
and then the effect of q on f
is going to be approximately

00:10:16.882 --> 00:10:18.628
a factor of negative four, right?

00:10:18.628 --> 00:10:20.405
So then we multiply these together

00:10:20.405 --> 00:10:24.053
and we get that the effect of y on f

00:10:24.053 --> 00:10:26.470
is going to be negative four.

00:10:30.887 --> 00:10:33.249
Okay, so now if we want
to do the same thing for

00:10:33.249 --> 00:10:35.632
the gradient with respect to x, right,

00:10:35.632 --> 00:10:38.074
we can do the, we can
follow the same procedure,

00:10:38.074 --> 00:10:41.191
and so what is this going to be?

00:10:41.191 --> 00:10:42.711
[students speaking away from microphone]

00:10:42.711 --> 00:10:44.746
- I heard the same.

00:10:44.746 --> 00:10:48.746
Yeah exactly, so in this
case we want to, again,

00:10:49.636 --> 00:10:51.051
apply the chain rule, right?

00:10:51.051 --> 00:10:55.882
We know the effect of q on
f is negative four,

00:10:55.882 --> 00:10:59.167
and here again, since we have
also the same addition node,

00:10:59.167 --> 00:11:01.958
dq over dx is equal to one, again,

00:11:01.958 --> 00:11:04.593
we have negative four times
one, right, and the gradient

00:11:04.593 --> 00:11:08.760
with respect to x is
going to be negative four.

00:11:11.467 --> 00:11:13.214
Okay, so what we're doing is, in backprop,

00:11:13.214 --> 00:11:15.389
is we basically have all of these nodes

00:11:15.389 --> 00:11:17.846
in our computational graph, but each node

00:11:17.846 --> 00:11:20.724
is only aware of its
immediate surroundings, right?

00:11:20.724 --> 00:11:23.874
So we have, at each node,
we have the local inputs

00:11:23.874 --> 00:11:25.579
that are connected to this node,

00:11:25.579 --> 00:11:27.312
the values that are flowing into the node,

00:11:27.312 --> 00:11:28.981
and then we also have the output

00:11:28.981 --> 00:11:32.432
that is directly outputted from this node.

00:11:32.432 --> 00:11:36.599
So here our local inputs are
x and y, and the output is z.

00:11:39.777 --> 00:11:43.469
And at this node we also know
the local gradient, right,

00:11:43.469 --> 00:11:46.607
we can compute the gradient
of z with respect to x,

00:11:46.607 --> 00:11:48.905
and the gradient of z with respect to y,

00:11:48.905 --> 00:11:51.217
and these are usually really
simple operations, right?

00:11:51.217 --> 00:11:53.115
Each node is going to be something like

00:11:53.115 --> 00:11:54.821
the addition or the multiplication

00:11:54.821 --> 00:11:56.786
that we had in that earlier example,

00:11:56.786 --> 00:11:58.276
which is something where
we can just write down

00:11:58.276 --> 00:12:00.158
the gradient, and we
don't have to, you know,

00:12:00.158 --> 00:12:04.665
go through very complex
calculus in order to find this.

00:12:04.665 --> 00:12:07.513
- [Student] Can you go
back and explain why

00:12:07.513 --> 00:12:11.699
more in the last slide was
different than planning

00:12:11.699 --> 00:12:15.313
the first part of it using
just normal calculus?

00:12:15.313 --> 00:12:18.683
- Yeah, so basically if we go back,

00:12:18.683 --> 00:12:20.183
hold on, let me...

00:12:21.740 --> 00:12:24.472
So if we go back here, we
could exactly write out,

00:12:24.472 --> 00:12:26.565
find all of these using just calculus,

00:12:26.565 --> 00:12:30.216
so we could say, you know,
we want df over dx, right,

00:12:30.216 --> 00:12:32.572
and we can probably
expand out this expression

00:12:32.572 --> 00:12:35.338
and see that it's just going to be z,

00:12:35.338 --> 00:12:37.056
but we can do this for, in this case,

00:12:37.056 --> 00:12:39.526
because it's simple, but
we'll see examples later on

00:12:39.526 --> 00:12:42.667
where once this becomes a
really complicated expression,

00:12:42.667 --> 00:12:44.916
you don't want to have to use calculus

00:12:44.916 --> 00:12:47.487
to derive, right, the
gradient for something,

00:12:47.487 --> 00:12:49.302
for a super-complicated expression,

00:12:49.302 --> 00:12:52.094
and instead, if you use this formalism

00:12:52.094 --> 00:12:55.018
and you break it down into
these computational nodes,

00:12:55.018 --> 00:12:58.001
then you can only ever work with gradients

00:12:58.001 --> 00:13:01.612
of very simple computations, right,

00:13:01.612 --> 00:13:04.925
at the level of, you know,
additions, multiplications,

00:13:04.925 --> 00:13:06.938
exponentials, things as
simple as you want them,

00:13:06.938 --> 00:13:08.743
and then you just use the chain rule

00:13:08.743 --> 00:13:10.038
to multiply all these together,

00:13:10.038 --> 00:13:12.404
and get your, the value of your gradient

00:13:12.404 --> 00:13:16.571
without having to ever
derive the entire expression.

00:13:18.562 --> 00:13:20.508
Does that make sense?

00:13:20.508 --> 00:13:21.367
[student murmuring]

00:13:21.367 --> 00:13:25.034
Okay, so we'll see an
example of this later.

00:13:28.751 --> 00:13:30.449
And so, was there another question, yeah?

00:13:30.449 --> 00:13:32.240
[student speaking away from microphone]

00:13:32.240 --> 00:13:33.778
- [Student] What's the negative four

00:13:33.778 --> 00:13:36.146
next to the z representing?

00:13:36.146 --> 00:13:38.408
- Negative, okay yeah,
so the negative four,

00:13:38.408 --> 00:13:39.884
these were the, the green values on top

00:13:39.884 --> 00:13:43.831
were all the values of
the function as we passed

00:13:43.831 --> 00:13:46.623
it forward through the
computational graph, right?

00:13:46.623 --> 00:13:49.835
So we said up here that x
is equal to negative two,

00:13:49.835 --> 00:13:52.591
y is equal to five, and
z equals negative four,

00:13:52.591 --> 00:13:55.621
so we filled in all of these
values, and then we just wanted

00:13:55.621 --> 00:13:59.286
to compute the value of this function.

00:13:59.286 --> 00:14:04.008
Right, so we said this value
of q is going to be x plus y,

00:14:04.008 --> 00:14:06.118
it's going to be negative
two plus five, it is going

00:14:06.118 --> 00:14:09.289
to be three, and we have z
is equal to negative four

00:14:09.289 --> 00:14:12.285
so we fill that in here,
and then we multiplied q

00:14:12.285 --> 00:14:14.192
and z together, negative four times three

00:14:14.192 --> 00:14:16.886
in order to get the
final value of f, right?

00:14:16.886 --> 00:14:18.175
And then the red values underneath

00:14:18.175 --> 00:14:20.540
were as we were filling in the gradients

00:14:20.540 --> 00:14:22.957
as we were working backwards.

00:14:24.927 --> 00:14:25.760
Okay.

00:14:29.418 --> 00:14:33.356
Okay, so right, so we said that, you know,

00:14:33.356 --> 00:14:34.845
we have these local, these nodes,

00:14:34.845 --> 00:14:38.969
and each node basically gets
its local inputs coming in

00:14:38.969 --> 00:14:40.886
and the output that it
sees directly passing on

00:14:40.886 --> 00:14:44.222
to the next node, and we also
have these local gradients

00:14:44.222 --> 00:14:46.302
that we computed, right, the gradient of

00:14:46.302 --> 00:14:48.476
the immediate output of the node

00:14:48.476 --> 00:14:51.175
with respect to the inputs coming in.

00:14:51.175 --> 00:14:55.155
And so what happens during
backprop is we have these,

00:14:55.155 --> 00:14:56.750
we'll start from the
back of the graph, right,

00:14:56.750 --> 00:14:58.471
and then we work our way from the end

00:14:58.471 --> 00:15:00.341
all the way back to the beginning,

00:15:00.341 --> 00:15:03.116
and when we reach each
node, at each node we have

00:15:03.116 --> 00:15:05.593
the upstream gradients coming back, right,

00:15:05.593 --> 00:15:08.980
with respect to the
immediate output of the node.

00:15:08.980 --> 00:15:11.626
So by the time we reach
this node in backprop,

00:15:11.626 --> 00:15:13.503
we've already computed the gradient

00:15:13.503 --> 00:15:17.808
of our final loss l,
with respect to z, right?

00:15:17.808 --> 00:15:20.310
And so now what we want to find next

00:15:20.310 --> 00:15:22.508
is we want to find the
gradients with respect

00:15:22.508 --> 00:15:26.675
to just before the node,
to the values of x and y.

00:15:27.679 --> 00:15:30.962
And so as we saw earlier, we
do this using the chain rule,

00:15:30.962 --> 00:15:33.053
right, we have from the chain rule,

00:15:33.053 --> 00:15:34.857
that the gradient of this loss function

00:15:34.857 --> 00:15:36.690
with respect to x is going to be

00:15:36.690 --> 00:15:41.482
the gradient with respect
to z times, compounded by

00:15:41.482 --> 00:15:44.987
this gradient, local gradient
of z with respect to x.

00:15:44.987 --> 00:15:46.369
Right, so in the chain rule we always take

00:15:46.369 --> 00:15:48.422
this upstream gradient coming down,

00:15:48.422 --> 00:15:50.650
and we multiply it by the local gradient

00:15:50.650 --> 00:15:55.071
in order to get the gradient
with respect to the input.

00:15:55.071 --> 00:15:57.349
- [Student] So, sorry, is it,

00:15:57.349 --> 00:15:59.731
it's different because
this would never work

00:15:59.731 --> 00:16:01.949
to get a general formula into the,

00:16:01.949 --> 00:16:04.478
or general symbolic
formula for the gradient.

00:16:04.478 --> 00:16:06.700
It only works with instantaneous values,

00:16:06.700 --> 00:16:07.533
where you like.

00:16:07.533 --> 00:16:08.423
[student coughing]

00:16:08.423 --> 00:16:11.896
Or passing a little constant
value as a symbolic.

00:16:11.896 --> 00:16:16.335
- So the question is
whether this only works

00:16:16.335 --> 00:16:19.496
because we're working
with the current values of

00:16:19.496 --> 00:16:22.579
the function, and so it works, right,

00:16:23.842 --> 00:16:25.745
given the current values of
the function that we plug in,

00:16:25.745 --> 00:16:27.611
but we can write an expression for this,

00:16:27.611 --> 00:16:29.819
still in terms of the variables, right?

00:16:29.819 --> 00:16:33.635
So we'll see that gradient
of L with respect to z

00:16:33.635 --> 00:16:36.357
is going to be some
expression, and gradient of z

00:16:36.357 --> 00:16:39.463
with respect to x is going to
be another expression, right?

00:16:39.463 --> 00:16:43.422
But we plug in these,
we plug in the values

00:16:43.422 --> 00:16:45.481
of these numbers at the
time in order to get

00:16:45.481 --> 00:16:48.708
the value of the gradient
with respect to x.

00:16:48.708 --> 00:16:51.958
So what you could do is you
could recursively plug in

00:16:51.958 --> 00:16:55.203
all of these expressions, right?

00:16:55.203 --> 00:16:58.239
Gradient with respect, z with respect to x

00:16:58.239 --> 00:17:01.442
is going to be a simple,
simple expression, right?

00:17:01.442 --> 00:17:03.708
So in this case, if we
have a multiplication node,

00:17:03.708 --> 00:17:05.739
gradient of z with
respect to x is just going

00:17:05.739 --> 00:17:08.612
to be y, right, we know that,

00:17:08.612 --> 00:17:10.898
but the gradient of L with respect to z,

00:17:10.898 --> 00:17:12.811
this is probably a complex part of

00:17:12.811 --> 00:17:17.354
the graph in itself, right, so
here's where we want to just,

00:17:17.355 --> 00:17:20.748
in this case, have this numerical, right?

00:17:20.748 --> 00:17:23.105
So as you said, basically
this is going to be just

00:17:23.105 --> 00:17:25.423
a number coming down, right, a value,

00:17:25.423 --> 00:17:26.540
and then we just multiply it with

00:17:26.540 --> 00:17:30.719
the expression that we have
for the local gradient.

00:17:30.719 --> 00:17:32.064
And I think this will be
more clear when we go through

00:17:32.064 --> 00:17:35.647
a more complicated
example in a few slides.

00:17:38.225 --> 00:17:40.685
Okay, so now the gradient
of L with respect to y,

00:17:40.685 --> 00:17:44.284
we have exactly the
same idea, where again,

00:17:44.284 --> 00:17:45.966
we use the chain rule,
we have gradient of L

00:17:45.966 --> 00:17:48.169
with respect to z, times the gradient of z

00:17:48.169 --> 00:17:50.661
with respect to y, right,
we use the chain rule,

00:17:50.661 --> 00:17:54.411
multiply these together
and get our gradient.

00:17:55.848 --> 00:17:57.910
And then once we have these,
we'll pass these on to

00:17:57.910 --> 00:18:00.816
the node directly before,
or connected to this node.

00:18:00.816 --> 00:18:03.484
And so the main thing
to take away from this

00:18:03.484 --> 00:18:06.920
is that at each node we just
want to have our local gradient

00:18:06.920 --> 00:18:09.047
that we compute, just keep track of this,

00:18:09.047 --> 00:18:12.282
and then during backprop as
we're receiving, you know,

00:18:12.282 --> 00:18:16.127
numerical values of gradients
coming from upstream,

00:18:16.127 --> 00:18:18.271
we just take what that is, multiply it by

00:18:18.271 --> 00:18:20.077
the local gradient, and then this is

00:18:20.077 --> 00:18:23.741
what we then send back
to the connected nodes,

00:18:23.741 --> 00:18:27.175
the next nodes going backwards,
without having to care

00:18:27.175 --> 00:18:31.664
about anything else besides
these immediate surroundings.

00:18:31.664 --> 00:18:33.795
So now we're going to go
through another example,

00:18:33.795 --> 00:18:35.353
this time a little bit more complex,

00:18:35.353 --> 00:18:39.239
so we can see more why
backprop is so useful.

00:18:39.239 --> 00:18:43.893
So in this case, our
function is f of w and x,

00:18:43.893 --> 00:18:46.626
which is equal to one over one plus e

00:18:46.626 --> 00:18:49.576
to the negative of w-zero times x-zero

00:18:49.576 --> 00:18:52.619
plus w-one x-one, plus w-two, right?

00:18:52.619 --> 00:18:54.978
So again, the first step always is we want

00:18:54.978 --> 00:18:57.525
to write this out as
a computational graph.

00:18:57.525 --> 00:18:59.798
So in this case we can see
that in this graph, right,

00:18:59.798 --> 00:19:02.863
first we multiply together the
w and x terms that we have,

00:19:02.863 --> 00:19:06.697
w-zero with x-zero, w-one with x-one,

00:19:06.697 --> 00:19:10.388
and w-two, then we add all
of these together, right?

00:19:10.388 --> 00:19:13.471
Then we do, scale it by negative one,

00:19:14.525 --> 00:19:17.372
we take the exponential, we add one,

00:19:17.372 --> 00:19:21.372
and then finally we do
one over this whole term.

00:19:22.533 --> 00:19:25.170
And then here I've also
filled in values of these,

00:19:25.170 --> 00:19:27.581
so let's say given values that we have

00:19:27.581 --> 00:19:30.726
for the ws and xs, right,
we can make a forward pass

00:19:30.726 --> 00:19:32.338
and basically compute what the value is

00:19:32.338 --> 00:19:35.171
at every stage of the computation.

00:19:37.091 --> 00:19:40.056
And here I've also written
down here at the bottom

00:19:40.056 --> 00:19:42.986
the values, the expressions
for some derivatives

00:19:42.986 --> 00:19:44.427
that are going to be helpful later on,

00:19:44.427 --> 00:19:49.339
so same as we did before
with the simple example.

00:19:49.339 --> 00:19:51.820
Okay, so now then we're going
to do backprop through here,

00:19:51.820 --> 00:19:53.327
right, so again, we're going to start at

00:19:53.327 --> 00:19:56.654
the very end of the
graph, and so here again

00:19:56.654 --> 00:20:00.736
the gradient of the output with
respect to the last variable

00:20:00.736 --> 00:20:04.074
is just one, it's just trivial,

00:20:04.074 --> 00:20:07.323
and so now moving
backwards one step, right?

00:20:07.323 --> 00:20:10.768
So what's the gradient with respect to

00:20:10.768 --> 00:20:13.405
the input just before one over x?

00:20:13.405 --> 00:20:18.250
Well, so in this case, we know
that the upstream gradient

00:20:18.250 --> 00:20:21.592
that we have coming down,
right, is this red one, right?

00:20:21.592 --> 00:20:24.153
This is the upstream gradient
that we have flowing down,

00:20:24.153 --> 00:20:26.938
and then now we need to find
the local gradient, right,

00:20:26.938 --> 00:20:28.358
and the local gradient of this node,

00:20:28.358 --> 00:20:30.181
this node is one over x, right,

00:20:30.181 --> 00:20:33.117
so we have f of x equals
one over x here in red,

00:20:33.117 --> 00:20:35.871
and the local gradient of this df over dx

00:20:35.871 --> 00:20:39.935
is equal to negative one
over x-squared, right?

00:20:39.935 --> 00:20:43.700
So here we're going to take
negative one over x-squared,

00:20:43.700 --> 00:20:45.845
and plug in the value
of x that we had during

00:20:45.845 --> 00:20:50.012
this forward pass, 1.37,
and so our final gradient

00:20:51.325 --> 00:20:52.805
with respect to this variable is going

00:20:52.805 --> 00:20:56.638
to be negative one over
1.37 squared times one

00:20:58.184 --> 00:20:59.934
equals negative 0.53.

00:21:04.382 --> 00:21:06.769
So moving back to the next node,

00:21:06.769 --> 00:21:09.023
we're going to go through the
exact same process, right?

00:21:09.023 --> 00:21:12.868
So here, the gradient
flowing from upstream

00:21:12.868 --> 00:21:16.007
is going to be negative 0.53, right,

00:21:16.007 --> 00:21:20.365
and here the local gradient,
the node here is a plus one,

00:21:20.365 --> 00:21:25.203
and so now looking at our
reference of derivatives at

00:21:25.203 --> 00:21:29.287
the bottom, we have that
for a constant plus x,

00:21:29.287 --> 00:21:31.729
the local gradient is just one, right?

00:21:31.729 --> 00:21:34.209
So what's the gradient with respect

00:21:34.209 --> 00:21:37.376
to this variable using the chain rule?

00:21:42.883 --> 00:21:44.842
So it's going to be the upstream gradient

00:21:44.842 --> 00:21:48.925
of negative 0.53 times
our local gradient of one,

00:21:50.021 --> 00:21:52.688
which is equal to negative 0.53.

00:21:55.849 --> 00:21:59.604
So let's keep moving
backwards one more step.

00:21:59.604 --> 00:22:02.098
So here we have the exponential, right?

00:22:02.098 --> 00:22:05.022
So what's the upstream
gradient coming down?

00:22:05.022 --> 00:22:08.536
[student speaking away from microphone]

00:22:08.536 --> 00:22:11.775
Right, so the upstream
gradient is negative 0.53,

00:22:11.775 --> 00:22:14.358
what's the local gradient here?

00:22:15.402 --> 00:22:18.002
It's going to be the local
gradient of e to the x, right?

00:22:18.002 --> 00:22:22.169
This is an exponential
node, and so our chain rule

00:22:23.590 --> 00:22:25.555
is going to tell us that our gradient

00:22:25.555 --> 00:22:29.722
is going to be negative 0.53
times e to the power of x,

00:22:30.869 --> 00:22:32.495
which in this case is negative one,

00:22:32.495 --> 00:22:34.670
from our forward pass, and
this is going to give us

00:22:34.670 --> 00:22:37.587
our final gradient of negative 0.2.

00:22:40.215 --> 00:22:42.882
Okay, so now one more node here,

00:22:44.234 --> 00:22:46.706
the next node is, that
we reach, is going to be

00:22:46.706 --> 00:22:48.912
a multiplication with negative one, right?

00:22:48.912 --> 00:22:52.729
So here, what's the upstream
gradient coming down?

00:22:52.729 --> 00:22:54.090
- [Student] Negative 0.2?

00:22:54.090 --> 00:22:56.565
- [Serena] Negative 0.2,
right, and what's going to be

00:22:56.565 --> 00:23:01.510
the local gradient, can
look at the reference sheet.

00:23:01.510 --> 00:23:03.049
It's going to be, what was it?

00:23:03.049 --> 00:23:03.889
I think I heard it.

00:23:03.889 --> 00:23:05.205
- [Student] That's minus one?

00:23:05.205 --> 00:23:09.680
- It's going to be minus
one, exactly, yeah,

00:23:09.680 --> 00:23:13.282
because our local gradient
says it's going to be,

00:23:13.282 --> 00:23:15.976
df over dx is a, right, and the value of a

00:23:15.976 --> 00:23:19.220
that we scaled x by is negative one here.

00:23:19.220 --> 00:23:21.806
So we have here that the gradient

00:23:21.806 --> 00:23:24.575
is negative one times negative 0.2,

00:23:24.575 --> 00:23:26.825
and so our gradient is 0.2.

00:23:29.169 --> 00:23:32.925
Okay, so now we've
reached an addition node,

00:23:32.925 --> 00:23:35.816
and so in this case we
have these two branches

00:23:35.816 --> 00:23:37.269
both connected to it, right?

00:23:37.269 --> 00:23:39.288
So what's the upstream gradient here?

00:23:39.288 --> 00:23:43.286
It's going to be 0.2, right,
just as everything else,

00:23:43.286 --> 00:23:46.186
and here now the gradient with respect

00:23:46.186 --> 00:23:50.122
to each of these branches,
it's an addition, right,

00:23:50.122 --> 00:23:52.586
and we saw from before
in our simple example

00:23:52.586 --> 00:23:54.199
that when we have an addition node,

00:23:54.199 --> 00:23:56.226
the gradient with respect
to each of the inputs

00:23:56.226 --> 00:23:59.266
to the addition is just
going to be one, right?

00:23:59.266 --> 00:24:03.661
So here, our local gradient
for looking at our top stream

00:24:03.661 --> 00:24:06.406
is going to be one times
the upstream gradient

00:24:06.406 --> 00:24:10.573
of 0.2, which is going to give
a total gradient of 0.2, right?

00:24:12.016 --> 00:24:14.219
And then we, for our bottom branch we'd do

00:24:14.219 --> 00:24:18.400
the same thing, right, our
upstream gradient is 0.2,

00:24:18.400 --> 00:24:20.269
our local gradient is one again,

00:24:20.269 --> 00:24:23.277
and the total gradient is 0.2.

00:24:23.277 --> 00:24:26.110
So is everything clear about this?

00:24:27.581 --> 00:24:28.414
Okay.

00:24:30.649 --> 00:24:32.758
So we have a few more
gradients to fill out,

00:24:32.758 --> 00:24:37.648
so moving back now we've
reached w-zero and x-zero,

00:24:37.648 --> 00:24:41.086
and so here we have a
multiplication node, right,

00:24:41.086 --> 00:24:44.169
so we saw the multiplication
node from before,

00:24:44.169 --> 00:24:45.698
it just, the gradient with respect

00:24:45.698 --> 00:24:49.506
to one of the inputs just is
the value of the other input.

00:24:49.506 --> 00:24:51.171
And so in this case, what's the gradient

00:24:51.171 --> 00:24:53.088
with respect to w-zero?

00:24:56.927 --> 00:24:58.795
- [Student] Minus 0.2.

00:24:58.795 --> 00:25:02.599
- Minus, I'm hearing minus 0.2, exactly.

00:25:02.599 --> 00:25:04.366
Yeah, so with respect to w-zero,

00:25:04.366 --> 00:25:07.866
we have our upstream gradient, 0.2, right,

00:25:09.747 --> 00:25:11.481
times our, this is the bottom one,

00:25:11.481 --> 00:25:13.781
times our value of x,
which is negative one,

00:25:13.781 --> 00:25:16.472
we get negative 0.2 and
we can do the same thing

00:25:16.472 --> 00:25:18.800
for our gradient with respect to x-zero.

00:25:18.800 --> 00:25:22.008
It's going to be 0.2
times the value of w-zero

00:25:22.008 --> 00:25:24.425
which is two, and we get 0.4.

00:25:26.525 --> 00:25:30.692
Okay, so here we've filled
out most of these gradients,

00:25:32.200 --> 00:25:35.805
and so there was the question earlier

00:25:35.805 --> 00:25:40.128
about why this is simpler
than just computing,

00:25:40.128 --> 00:25:43.071
deriving the analytic gradient,
the expression with respect

00:25:43.071 --> 00:25:44.558
to any of these variables, right?

00:25:44.558 --> 00:25:47.166
And so you can see here,
all we ever dealt with

00:25:47.166 --> 00:25:50.164
was expressions for local gradients

00:25:50.164 --> 00:25:52.165
that we had to write out, so
once we had these expressions

00:25:52.165 --> 00:25:54.034
for local gradients, all we did

00:25:54.034 --> 00:25:56.850
was plug in the values for
each of these that we have,

00:25:56.850 --> 00:25:59.656
and use the chain rule to
numerically multiply this all

00:25:59.656 --> 00:26:01.532
the way backwards and get the gradients

00:26:01.532 --> 00:26:04.615
with respect to all of the variables.

00:26:08.779 --> 00:26:11.644
And so, you know, we can also fill out

00:26:11.644 --> 00:26:14.490
the gradients with respect
to w-one and x-one here

00:26:14.490 --> 00:26:16.535
in exactly the same way, and so one thing

00:26:16.535 --> 00:26:19.763
that I want to note is that
right when we're creating

00:26:19.763 --> 00:26:21.996
these computational graphs, we can define

00:26:21.996 --> 00:26:25.598
the computational nodes at any
granularity that we want to.

00:26:25.598 --> 00:26:27.896
So in this case, we broke it down into

00:26:27.896 --> 00:26:29.813
the absolute simplest
that we could, right,

00:26:29.813 --> 00:26:33.320
we broke it down into
additions and multiplications,

00:26:33.320 --> 00:26:36.340
you know, it basically can't
get any simpler than that,

00:26:36.340 --> 00:26:38.883
but in practice, right,
we can group some of

00:26:38.883 --> 00:26:42.706
these nodes together into
more complex nodes if we want.

00:26:42.706 --> 00:26:44.600
As long as we're able to write down

00:26:44.600 --> 00:26:47.245
the local gradient for that node, right?

00:26:47.245 --> 00:26:51.412
And so as an example, if we
look at a sigmoid function,

00:26:53.153 --> 00:26:55.691
so I've defined the sigmoid function in

00:26:55.691 --> 00:26:57.861
the upper-right here, of a sigmoid of x

00:26:57.861 --> 00:27:00.854
is equal to one over one
plus e to the negative x,

00:27:00.854 --> 00:27:03.404
and this is something that's
a really common function

00:27:03.404 --> 00:27:05.996
that you'll see a lot in
the rest of this class,

00:27:05.996 --> 00:27:09.904
and we can compute the gradient for this,

00:27:09.904 --> 00:27:12.686
we can write it out, and if
we do actually go through

00:27:12.686 --> 00:27:16.427
the math of doing this analytically,

00:27:16.427 --> 00:27:18.173
we can get a nice expression at the end.

00:27:18.173 --> 00:27:22.598
So in this case it's equal
to one minus sigma of x,

00:27:22.598 --> 00:27:26.609
so the output of this function
times sigma of x, right?

00:27:26.609 --> 00:27:29.534
And so in cases where we
have something like this,

00:27:29.534 --> 00:27:33.210
we could just take all the computations

00:27:33.210 --> 00:27:35.730
that we had in our graph
that made up this sigmoid,

00:27:35.730 --> 00:27:37.038
and we could just replace it

00:27:37.038 --> 00:27:39.837
with one big node that's a sigmoid, right,

00:27:39.837 --> 00:27:43.655
because we do know the local
gradient for this gate,

00:27:43.655 --> 00:27:48.564
it's this expression, d of the
sigmoid of x over dx, right?

00:27:48.564 --> 00:27:51.260
So basically the important thing here is

00:27:51.260 --> 00:27:54.795
that you can, group any nodes that you want

00:27:54.795 --> 00:27:58.791
to make any sorts of a little
bit more complex nodes,

00:27:58.791 --> 00:28:01.645
as long as you can write down
the local gradient for this.

00:28:01.645 --> 00:28:04.268
And so all this is is
basically a trade-off between,

00:28:04.268 --> 00:28:06.433
you know, how much math
that you want to do

00:28:06.433 --> 00:28:11.043
in order to get a more, kind
of concise and simpler graph,

00:28:11.043 --> 00:28:13.793
right, versus how simple you want

00:28:14.913 --> 00:28:16.849
each of your gradients to be, right?

00:28:16.849 --> 00:28:19.290
And then you can write out as complex of

00:28:19.290 --> 00:28:21.616
a computational graph that you want.

00:28:21.616 --> 00:28:23.358
Yeah, question?

00:28:23.358 --> 00:28:25.278
- [Student] This is a
question on the graph itself,

00:28:25.278 --> 00:28:28.149
is there a reason that the
first two multiplication nodes

00:28:28.149 --> 00:28:32.055
and the weights are not connected
to a single addition node?

00:28:32.055 --> 00:28:33.486
- So they could also be connected into

00:28:33.486 --> 00:28:36.771
a single addition node,
so the question was,

00:28:36.771 --> 00:28:39.835
is there a reason why w-zero and x-zero

00:28:39.835 --> 00:28:41.868
are not connected with w-two?

00:28:41.868 --> 00:28:44.167
All of these additions
just connected together,

00:28:44.167 --> 00:28:46.179
and yeah, so the reason, the answer is

00:28:46.179 --> 00:28:48.432
that you can do that if you want,

00:28:48.432 --> 00:28:50.479
and in practice, maybe you
would actually want to do that

00:28:50.479 --> 00:28:52.821
because this is still a
very simple node, right?

00:28:52.821 --> 00:28:56.658
So in this case I just wrote
this out into as simple

00:28:56.658 --> 00:29:01.414
as possible, where each node
only had up to two inputs,

00:29:01.414 --> 00:29:04.499
but yeah, you could definitely do that.

00:29:04.499 --> 00:29:07.082
Any other questions about this?

00:29:08.682 --> 00:29:11.399
Okay, so the one thing that I really like

00:29:11.399 --> 00:29:13.681
about thinking about this
like a computational graph

00:29:13.681 --> 00:29:15.406
is that I feel very comforted, right,

00:29:15.406 --> 00:29:17.934
like anytime I have to take a gradient,

00:29:17.934 --> 00:29:20.253
find gradients of something,
even if the expression

00:29:20.253 --> 00:29:22.960
that I want to compute
gradients of is really hairy,

00:29:22.960 --> 00:29:24.950
and really scary, you know,
whether it's something like

00:29:24.950 --> 00:29:26.939
this sigmoid or something worse,

00:29:26.939 --> 00:29:30.702
I know that, you know, I could
derive this if I want to,

00:29:30.702 --> 00:29:33.214
but really, if I just
sit down and write it out

00:29:33.214 --> 00:29:35.024
in terms of a computational graph,

00:29:35.024 --> 00:29:37.033
I can go as simple as I need to

00:29:37.033 --> 00:29:40.405
to always be able to apply
backprop and the chain rule,

00:29:40.405 --> 00:29:44.058
and be able to compute all
the gradients that I need.

00:29:44.058 --> 00:29:46.623
And so this is something that
you guys should think about

00:29:46.623 --> 00:29:51.204
when you're doing your homeworks,
as basically, you know,

00:29:51.204 --> 00:29:53.438
anytime you're having trouble
finding gradients of something

00:29:53.438 --> 00:29:55.474
just think about it as
a computational graph,

00:29:55.474 --> 00:29:57.285
break it down into all of these parts,

00:29:57.285 --> 00:29:59.618
and then use the chain rule.

00:30:00.884 --> 00:30:04.040
Okay, and so, you know, so we talked about

00:30:04.040 --> 00:30:06.747
how we could group these
set of nodes together

00:30:06.747 --> 00:30:10.558
into a sigmoid gate, and
just to confirm, like,

00:30:10.558 --> 00:30:12.465
that this is actually exactly equivalent,

00:30:12.465 --> 00:30:14.098
we can plug this in, right?

00:30:14.098 --> 00:30:18.182
So we have that our input
here to the sigmoid gate

00:30:18.182 --> 00:30:21.717
is going to be one, in
green, and then we have

00:30:21.717 --> 00:30:25.800
that the output is going
to be here, 0.73, right,

00:30:26.745 --> 00:30:28.208
and this'll work out if you plug

00:30:28.208 --> 00:30:29.943
it in to the sigmoid function.

00:30:29.943 --> 00:30:33.012
And so now if we want to
do, if we want to take

00:30:33.012 --> 00:30:36.315
the gradient, and we want
to treat this entire sigmoid

00:30:36.315 --> 00:30:39.525
as one node, now what we should do

00:30:39.525 --> 00:30:41.219
is we need to use this local gradient

00:30:41.219 --> 00:30:42.833
that we've derived up here, right?

00:30:42.833 --> 00:30:45.962
One minus sigmoid of x
times the sigmoid of x.

00:30:45.962 --> 00:30:48.894
So if we plug this in, and here we know

00:30:48.894 --> 00:30:51.612
that the value of sigmoid of x was 0.73,

00:30:51.612 --> 00:30:54.267
so if we plug this value
in we'll see that this,

00:30:54.267 --> 00:30:57.543
the value of this gradient
is equal to 0.2, right,

00:30:57.543 --> 00:31:00.206
and so the value of this
local gradient is 0.2,

00:31:00.206 --> 00:31:03.466
we multiply it by the x
upstream gradient which is one,

00:31:03.466 --> 00:31:07.044
and we're going to get
out exactly the same value

00:31:07.044 --> 00:31:09.267
of the gradient with respect
to before the sigmoid gate,

00:31:09.267 --> 00:31:13.434
as if we broke it down into all
of the smaller computations.

00:31:17.191 --> 00:31:19.325
Okay, and so as we're looking
at what's happening, right,

00:31:19.325 --> 00:31:23.714
as we're taking these
gradients going backwards

00:31:23.714 --> 00:31:25.168
through our computational graph,

00:31:25.168 --> 00:31:28.790
there's some patterns that you'll notice

00:31:28.790 --> 00:31:31.228
where there's some
intuitive interpretation

00:31:31.228 --> 00:31:33.058
that we can give these, right?

00:31:33.058 --> 00:31:37.214
So we saw that the add gate is
a gradient distributor right,

00:31:37.214 --> 00:31:41.129
when we passed through
this addition gate here,

00:31:41.129 --> 00:31:43.930
which had two branches coming out of it,

00:31:43.930 --> 00:31:46.103
it took the gradient,
the upstream gradient

00:31:46.103 --> 00:31:48.586
and it just distributed it,
passed the exact same thing

00:31:48.586 --> 00:31:51.947
to both of the branches
that were connected.

00:31:51.947 --> 00:31:55.156
So here's a couple more
that we can think about.

00:31:55.156 --> 00:31:57.739
So what's a max gate look like?

00:31:59.214 --> 00:32:01.310
So we have a max gate
here at the bottom, right,

00:32:01.310 --> 00:32:04.825
where the input's coming in are z and w,

00:32:04.825 --> 00:32:08.665
z has a value of two, w has
a value of negative one,

00:32:08.665 --> 00:32:11.444
and then we took the max of
this, which is two, right,

00:32:11.444 --> 00:32:14.251
and so we pass this
down into the remainder

00:32:14.251 --> 00:32:16.877
of our computational graph.

00:32:16.877 --> 00:32:20.068
So now if we're taking the
gradients with respect to this,

00:32:20.068 --> 00:32:23.223
the upstream gradient is, let's
say two coming back, right,

00:32:23.223 --> 00:32:26.890
and what does this local
gradient look like?

00:32:30.057 --> 00:32:31.753
So anyone, yes?

00:32:31.753 --> 00:32:35.011
- [Student] It'll be zero for
one, and one for the other?

00:32:35.011 --> 00:32:35.862
- Right.

00:32:35.862 --> 00:32:39.435
[student speaking away from microphone]

00:32:39.435 --> 00:32:42.607
Exactly, so the answer that was given

00:32:42.607 --> 00:32:45.873
is that z will have a gradient of two,

00:32:45.873 --> 00:32:49.073
w will have a value, a gradient of zero,

00:32:49.073 --> 00:32:50.931
and so one of these is going to get

00:32:50.931 --> 00:32:53.112
the full value of the
gradient just passed back,

00:32:53.112 --> 00:32:57.140
and routed to that variable,
and then the other one

00:32:57.140 --> 00:33:00.223
will have a gradient of zero, and so,

00:33:01.328 --> 00:33:03.472
so we can think of this as kind
of a gradient router, right,

00:33:03.472 --> 00:33:05.949
so, whereas the addition node passed back

00:33:05.949 --> 00:33:08.574
the same gradient to
both branches coming in,

00:33:08.574 --> 00:33:10.154
the max gate will just take the gradient

00:33:10.154 --> 00:33:12.630
and route it to one of the branches,

00:33:12.630 --> 00:33:15.344
and this makes sense because
if we look at our forward pass,

00:33:15.344 --> 00:33:16.888
what's happening is that only the value

00:33:16.888 --> 00:33:19.393
that was the maximum got passed down to

00:33:19.393 --> 00:33:22.594
the rest of the
computational graph, right?

00:33:22.594 --> 00:33:25.022
So it's the only value
that actually affected

00:33:25.022 --> 00:33:27.647
our function computation at
the end, and so it makes sense

00:33:27.647 --> 00:33:29.443
that when we're passing
our gradients back,

00:33:29.443 --> 00:33:32.610
we just want to adjust what, you know,

00:33:33.544 --> 00:33:38.270
flow it through that
branch of the computation.

00:33:38.270 --> 00:33:40.705
Okay, and so another one,
what's a multiplication gate,

00:33:40.705 --> 00:33:44.872
which we saw earlier, is there
any interpretation of this?

00:33:46.028 --> 00:33:50.195
[student speaking away from microphone]

00:33:52.976 --> 00:33:55.445
Okay, so the answer that was given

00:33:55.445 --> 00:33:57.282
is that the local
gradient is basically just

00:33:57.282 --> 00:33:59.408
the value of the other variable.

00:33:59.408 --> 00:34:01.407
Yeah, so that's exactly right.

00:34:01.407 --> 00:34:04.712
So we can think of this as
a gradient switcher, right?

00:34:04.712 --> 00:34:06.409
A switcher, and I guess
a scaler, where we take

00:34:06.409 --> 00:34:08.802
the upstream gradient and we scale it by

00:34:08.802 --> 00:34:11.302
the value of the other branch.

00:34:15.719 --> 00:34:17.559
Okay, and so one other thing to note is

00:34:17.559 --> 00:34:20.205
that when we have a place where one node

00:34:20.205 --> 00:34:22.592
is connected to multiple nodes,

00:34:22.592 --> 00:34:26.187
the gradients add up at this node, right?

00:34:26.187 --> 00:34:29.804
So at these branches, using
the multivariate chain rule,

00:34:29.804 --> 00:34:31.520
we're just going to take the value of

00:34:31.520 --> 00:34:35.274
the upstream gradient coming
back from each of these nodes,

00:34:35.274 --> 00:34:37.364
and we'll add these
together to get the total

00:34:37.364 --> 00:34:40.499
upstream gradient that's
flowing back into this node,

00:34:40.500 --> 00:34:43.005
and you can see this from
the multivariate chain rule

00:34:43.005 --> 00:34:47.046
and also thinking about this,
you can think about this

00:34:47.047 --> 00:34:49.847
that if you're going to
change this node a little bit,

00:34:49.847 --> 00:34:52.781
it's going to affect both
of these connected nodes

00:34:52.781 --> 00:34:54.949
in the forward pass,
right, when you're making

00:34:54.949 --> 00:34:57.263
your forward pass through the graph.

00:34:57.263 --> 00:34:59.478
And so then when you're
doing backprop, right,

00:34:59.478 --> 00:35:03.803
then now the, both of
these gradients coming back

00:35:03.803 --> 00:35:06.015
are going to affect this node, right,

00:35:06.015 --> 00:35:07.872
and so that's how we're
going to sum these up to be

00:35:07.872 --> 00:35:12.725
the total upstream gradient
flowing back into this node.

00:35:12.725 --> 00:35:15.892
Okay, so any questions about backprop,

00:35:18.439 --> 00:35:22.333
going through these forward
and backward passes?

00:35:22.333 --> 00:35:23.429
- [Student] So we haven't did anything

00:35:23.429 --> 00:35:25.490
to actually update the weights.

00:35:25.490 --> 00:35:29.072
[speaking away from microphone]

00:35:29.072 --> 00:35:31.418
- Right, so the question is,
we haven't done anything yet

00:35:31.418 --> 00:35:33.284
to update the values of these weights,

00:35:33.284 --> 00:35:35.994
we've only found the
gradients with respect

00:35:35.994 --> 00:35:37.867
to the variables, that's exactly right.

00:35:37.867 --> 00:35:41.206
So what we've talked about
so far in this lecture is how

00:35:41.206 --> 00:35:45.070
to compute gradients with
respect to any variables

00:35:45.070 --> 00:35:48.210
in our function, right,
and then once we have these

00:35:48.210 --> 00:35:51.258
we can just apply everything we learned in

00:35:51.258 --> 00:35:54.419
the optimization lecture,
last lecture, right?

00:35:54.419 --> 00:35:57.363
So given the gradient,
we now take a step in

00:35:57.363 --> 00:35:59.291
the direction of the gradient in order

00:35:59.291 --> 00:36:02.391
to update our weight,
our parameters, right?

00:36:02.391 --> 00:36:04.331
So you can just take this entire framework

00:36:04.331 --> 00:36:07.049
that we learned about last
lecture for optimization,

00:36:07.049 --> 00:36:11.196
and what we've done here is
just learn how to compute

00:36:11.196 --> 00:36:14.747
the gradients we need for
arbitrarily complex functions,

00:36:14.747 --> 00:36:16.920
right, and so this is going
to be useful when we talk

00:36:16.920 --> 00:36:20.039
about complex functions like
neural networks later on.

00:36:20.039 --> 00:36:21.083
Yeah?

00:36:21.083 --> 00:36:22.667
- [Student] Do you mind writing out the,

00:36:22.667 --> 00:36:24.453
all the variate, so you could help explain

00:36:24.453 --> 00:36:27.152
this slide a little better?

00:36:27.152 --> 00:36:31.069
- Yeah, so I can write
this maybe on the board.

00:36:32.851 --> 00:36:37.430
Right, so basically if we're
going to have, let's see,

00:36:37.430 --> 00:36:39.544
if we're going to have the gradient of f

00:36:39.544 --> 00:36:41.904
with respect to some variable x, right,

00:36:41.904 --> 00:36:45.821
and let's say it's
connected through variables,

00:36:49.203 --> 00:36:51.953
let's see, i, we can basically...

00:37:01.680 --> 00:37:03.311
Right, so this is basically saying

00:37:03.311 --> 00:37:07.436
that if x is connected to
these multiple elements, right,

00:37:07.436 --> 00:37:10.159
which in this case, different q-is,

00:37:10.159 --> 00:37:12.992
then the chain rule is taking all,

00:37:14.228 --> 00:37:16.453
it's going to take the effect of each of

00:37:16.453 --> 00:37:18.494
these intermediate variables, right,

00:37:18.494 --> 00:37:22.897
on our final output f, and
then compound each one with

00:37:22.897 --> 00:37:26.447
the local effect of our variable x on

00:37:26.447 --> 00:37:29.127
that intermediate value, right?

00:37:29.127 --> 00:37:33.294
So yeah, it's basically just
summing all these up together.

00:37:35.755 --> 00:37:39.525
Okay, so now that we've, you
know, done all these examples

00:37:39.525 --> 00:37:42.203
in the scalar case, we're going to look at

00:37:42.203 --> 00:37:44.720
what happens when we have vectors, right?

00:37:44.720 --> 00:37:47.054
So now if our variables x, y and z,

00:37:47.054 --> 00:37:50.250
instead of just being numbers,
we have vectors for these.

00:37:50.250 --> 00:37:53.877
And so everything stays exactly
the same, the entire flow,

00:37:53.877 --> 00:37:56.236
the only difference is
that now our gradients

00:37:56.236 --> 00:37:59.428
are going to be Jacobian matrices, right,

00:37:59.428 --> 00:38:04.014
so these are now going
to be matrices containing

00:38:04.014 --> 00:38:07.741
the derivative of each
element of, for example z

00:38:07.741 --> 00:38:10.574
with respect to each element of x.

00:38:14.237 --> 00:38:18.355
Okay, and so to, you
know, so give an example

00:38:18.355 --> 00:38:20.821
of something where this is
happening, right, let's say

00:38:20.821 --> 00:38:24.049
that we have our input is
going to now be a vector,

00:38:24.049 --> 00:38:27.621
so let's say we have a
4096-dimensional input vector,

00:38:27.621 --> 00:38:29.797
and this is kind of a common
size that you might see

00:38:29.797 --> 00:38:33.842
in convolutional neural networks later on,

00:38:33.842 --> 00:38:37.918
and our node is going to be an
element-wise maximum, right?

00:38:37.918 --> 00:38:41.128
So we have f of x is equal to the maximum

00:38:41.128 --> 00:38:45.295
of x compared with zero
element-wise, and then our output

00:38:46.875 --> 00:38:50.708
is going to be also a
4096-dimensional vector.

00:38:53.625 --> 00:38:55.351
Okay, so in this case, what's the size

00:38:55.351 --> 00:38:56.969
of our Jacobian matrix?

00:38:56.969 --> 00:38:59.281
Remember I said earlier,
the Jacobian matrix

00:38:59.281 --> 00:39:01.903
is going to be, like each row is,

00:39:01.903 --> 00:39:03.628
it's going to be partial derivatives,

00:39:03.628 --> 00:39:06.229
a matrix of partial derivatives
of each dimension of

00:39:06.229 --> 00:39:10.396
the output with respect to
each dimension of the input.

00:39:11.705 --> 00:39:15.230
Okay, so the answer I
heard was 4,096 squared,

00:39:15.230 --> 00:39:17.572
and that's, yeah, that's correct.

00:39:17.572 --> 00:39:21.405
So this is pretty large,
right, 4,096 by 4,096

00:39:22.261 --> 00:39:25.203
and in practice this is
going to be even larger

00:39:25.203 --> 00:39:27.400
because we're going to
work with many batches

00:39:27.400 --> 00:39:30.692
of, you know, of, for example, 100 inputs

00:39:30.692 --> 00:39:33.187
at the same time, right,
and we'll put all of these

00:39:33.187 --> 00:39:36.205
through our node at the same
time to be more efficient,

00:39:36.205 --> 00:39:38.739
and so this is going to scale this by 100,

00:39:38.739 --> 00:39:40.595
and in practice our Jacobian's
actually going to turn out

00:39:40.595 --> 00:39:45.082
to be something like
409,000 by 409,000 right,

00:39:45.082 --> 00:39:48.499
so this is really huge, and basically

00:39:48.499 --> 00:39:50.750
completely impractical to work with.

00:39:50.750 --> 00:39:53.633
So in practice though,
we don't actually need

00:39:53.633 --> 00:39:57.507
to compute this huge
Jacobian most of the time,

00:39:57.507 --> 00:39:59.149
and so why is that, like, what does

00:39:59.149 --> 00:40:00.902
this Jacobian matrix look like?

00:40:00.902 --> 00:40:04.509
If we think about what's happening here,

00:40:04.509 --> 00:40:06.777
where we're taking this
element-wise maximum,

00:40:06.777 --> 00:40:08.312
and we think about what are each of

00:40:08.312 --> 00:40:10.899
the partial derivatives, right,

00:40:10.899 --> 00:40:13.888
which dimension of the inputs affect

00:40:13.888 --> 00:40:15.706
which dimensions of the output?

00:40:15.706 --> 00:40:19.686
What sort of structure can we
see in our Jacobian matrix?

00:40:19.686 --> 00:40:21.198
[student speaking away from microphone]

00:40:21.198 --> 00:40:23.869
Okay, so I heard that it's
diagonal, right, exactly.

00:40:23.869 --> 00:40:27.703
So because this is element-wise,
right, each element of

00:40:27.703 --> 00:40:31.109
the input, say the first
dimension, only affects

00:40:31.109 --> 00:40:34.741
that corresponding element
in the output, right?

00:40:34.741 --> 00:40:38.014
And so because of that
our Jacobian matrix,

00:40:38.014 --> 00:40:41.567
which is just going to
be a diagonal matrix.

00:40:41.567 --> 00:40:43.924
And so in practice then,
we don't actually have

00:40:43.924 --> 00:40:47.198
to write out and formulate
this entire Jacobian,

00:40:47.198 --> 00:40:51.365
we can just know the effect
of x on the output, right,

00:40:54.986 --> 00:40:57.883
and then we can just
use these values, right,

00:40:57.883 --> 00:41:01.800
and fill it in as we're
computing the gradient.

00:41:04.100 --> 00:41:06.716
Okay, so now we're going to go through

00:41:06.716 --> 00:41:10.693
a more concrete vectorized
example of a computational graph.

00:41:10.693 --> 00:41:11.724
Right, so let's look at a case

00:41:11.724 --> 00:41:15.065
where we have the function f of x and W

00:41:15.065 --> 00:41:19.232
is equal to, basically the
L-two of W multiplied by x,

00:41:22.407 --> 00:41:24.013
and so in this case we're going to say x

00:41:24.013 --> 00:41:26.763
is n-dimensional and W is n by n.

00:41:29.076 --> 00:41:30.563
Right, so again our first step,

00:41:30.563 --> 00:41:32.635
writing out the
computational graph, right?

00:41:32.635 --> 00:41:35.303
We have W multiplied by
x, and then followed by,

00:41:35.303 --> 00:41:37.886
I'm just going to call this L-two.

00:41:39.520 --> 00:41:42.822
And so now let's also fill
out some values for this,

00:41:42.822 --> 00:41:45.382
so we can see that, you
know, let's say have W be

00:41:45.382 --> 00:41:47.560
this two by two matrix, and x

00:41:47.560 --> 00:41:50.632
is going to be this
two-dimensional vector, right?

00:41:50.632 --> 00:41:53.949
And so we can say, label
again our intermediate nodes.

00:41:53.949 --> 00:41:56.653
So our intermediate node
after the multiplication

00:41:56.653 --> 00:42:00.230
it's going to be q, we
have q equals W times x,

00:42:00.230 --> 00:42:02.666
which we can write out
element-wise this way,

00:42:02.666 --> 00:42:05.632
where the first element is
just W-one-one times x-one

00:42:05.632 --> 00:42:08.904
plus W-one-two times x-two and so on,

00:42:08.904 --> 00:42:12.611
and then we can now express
f in relation to q, right?

00:42:12.611 --> 00:42:15.508
So looking at the second
node we have f of q

00:42:15.508 --> 00:42:18.175
is equal to the L-two norm of q,

00:42:19.260 --> 00:42:24.218
which is equal to q-one
squared plus q-two squared.

00:42:24.218 --> 00:42:25.923
Okay, so we filled this in, right,

00:42:25.923 --> 00:42:29.423
we get q and then we get our final output.

00:42:30.491 --> 00:42:33.218
Okay, so now let's do
backprop through this, right?

00:42:33.218 --> 00:42:36.333
So again, this is always the first step,

00:42:36.333 --> 00:42:40.500
we have the gradient with respect
to our output is just one.

00:42:44.084 --> 00:42:47.505
Okay, so now let's move back one node,

00:42:47.505 --> 00:42:50.902
so now we want to find the
gradient with respect to q,

00:42:50.902 --> 00:42:55.493
right, our intermediate
variable before the L-two.

00:42:55.493 --> 00:42:58.719
And so q is a two-dimensional vector,

00:42:58.719 --> 00:43:00.065
and what we want to do is we want

00:43:00.065 --> 00:43:04.232
to find how each element of q
affects our final value of f,

00:43:07.918 --> 00:43:09.586
right, and so if we
look at this expression

00:43:09.586 --> 00:43:11.955
that we've written out
for f here at the bottom,

00:43:11.955 --> 00:43:14.705
we can see that the gradient of f

00:43:15.644 --> 00:43:19.293
with respect to a specific
q-i, let's say q-one,

00:43:19.293 --> 00:43:23.136
is just going to be two times q-i, right?

00:43:23.136 --> 00:43:26.569
This is just taking this derivative here,

00:43:26.569 --> 00:43:30.293
and so we have this expression for,

00:43:30.293 --> 00:43:32.453
with respect to each element of q-i,

00:43:32.453 --> 00:43:34.606
we could also, you know, write this out

00:43:34.606 --> 00:43:36.944
in vector form if we want to,

00:43:36.944 --> 00:43:39.869
it's just going to be two
times our vector of q, right,

00:43:39.869 --> 00:43:41.719
if we want to write
this out in vector form,

00:43:41.719 --> 00:43:45.698
and so what we get is
that our gradient is 0.44,

00:43:45.698 --> 00:43:48.226
and 0.52, this vector, right?

00:43:48.226 --> 00:43:50.656
And so you can see that it just took q

00:43:50.656 --> 00:43:52.004
and it scaled it by two, right?

00:43:52.004 --> 00:43:55.192
Each element is just multiplied by two.

00:43:55.192 --> 00:43:58.772
So the gradient of a vector
is always going to be

00:43:58.772 --> 00:44:01.182
the same size as the original vector,

00:44:01.182 --> 00:44:05.015
and each element of this
gradient is going to,

00:44:06.530 --> 00:44:09.132
it means how much of
this particular element

00:44:09.132 --> 00:44:12.549
affects our final output of the function.

00:44:16.126 --> 00:44:18.920
Okay, so now let's move
one step backwards, right,

00:44:18.920 --> 00:44:22.523
what's the gradient with respect to W?

00:44:22.523 --> 00:44:25.639
And so here again we want
to use the same concept

00:44:25.639 --> 00:44:26.998
of trying to apply the chain rule, right,

00:44:26.998 --> 00:44:29.129
so we want to compute our local gradient

00:44:29.129 --> 00:44:31.328
of q with respect to W, and so let's look

00:44:31.328 --> 00:44:34.683
at this again element-wise,
and if we do that,

00:44:34.683 --> 00:44:37.884
let's see what's the
effect of each q, right,

00:44:37.884 --> 00:44:41.636
each element of q with
respect to each element of W,

00:44:41.636 --> 00:44:43.106
and so this is going to be the Jacobian

00:44:43.106 --> 00:44:47.592
that we talked about earlier,
and if we look at this

00:44:47.592 --> 00:44:49.509
in this multiplication, q is equal

00:44:49.509 --> 00:44:53.092
to W times x, right,
what's the derivative,

00:44:56.236 --> 00:44:59.197
or the gradient of the first element of q,

00:44:59.197 --> 00:45:03.962
so our first element up top,
with respect to W-one-one?

00:45:03.962 --> 00:45:07.470
So q-one with respect to W-one-one?

00:45:07.470 --> 00:45:09.157
What's that value?

00:45:09.157 --> 00:45:10.851
X-one, exactly.

00:45:10.851 --> 00:45:14.068
Yeah, so we know that this is x-one,

00:45:14.068 --> 00:45:17.659
and we can write this
out more generally of

00:45:17.659 --> 00:45:21.826
the gradient of q-k with respect
to W-i,j is equal to X-j.

00:45:24.313 --> 00:45:26.111
And then now if we want
to find the gradient

00:45:26.111 --> 00:45:30.278
with respect to, of f,
with respect to each W-i,j.

00:45:31.523 --> 00:45:35.332
So looking at these derivatives now,

00:45:35.332 --> 00:45:38.339
we can use this chain rule
that we talked earlier

00:45:38.339 --> 00:45:41.672
where we basically compound df over dq-k

00:45:43.770 --> 00:45:47.270
for each element of q with dq-k over W-i,j

00:45:48.934 --> 00:45:51.498
for each element of W-i,j, right?

00:45:51.498 --> 00:45:53.563
So we find the effect of each element of W

00:45:53.563 --> 00:45:57.649
on each element of q, and
sum this across all q.

00:45:57.649 --> 00:45:59.653
And so if you write this
out, this is going to give

00:45:59.653 --> 00:46:03.236
this expression of two
times q-i times x-j.

00:46:05.898 --> 00:46:08.744
Okay, and so filling this out then we get

00:46:08.744 --> 00:46:11.564
this gradient with respect to W,

00:46:11.564 --> 00:46:14.270
and so again we can compute
this each element-wise,

00:46:14.270 --> 00:46:17.360
or we can also look at this
expression that we've derived

00:46:17.360 --> 00:46:20.943
and write it out in
vectorized form, right?

00:46:22.028 --> 00:46:24.827
So okay, and remember, the important thing

00:46:24.827 --> 00:46:28.484
is always to check the gradient
with respect to a variable

00:46:28.484 --> 00:46:31.137
should have the same shape as
the variable, and something,

00:46:31.137 --> 00:46:33.665
so this is something
really useful in practice

00:46:33.665 --> 00:46:36.108
to sanity check, right,
like once you've computed

00:46:36.108 --> 00:46:38.042
what your gradient should
be, check that this is

00:46:38.042 --> 00:46:40.709
the same shape as your variable,

00:46:42.710 --> 00:46:46.220
because again, the element,
each element of your gradient

00:46:46.220 --> 00:46:49.176
is quantifying how much that element

00:46:49.176 --> 00:46:53.343
is contributing to your, is
affecting your final output.

00:46:54.177 --> 00:46:55.010
Yeah?

00:46:55.010 --> 00:46:57.489
[student speaking away from microphone]

00:46:57.489 --> 00:46:59.023
The both sides, oh the both sides one

00:46:59.023 --> 00:47:01.427
is an indicator function,
so this is saying

00:47:01.427 --> 00:47:04.177
that it's just one if k equals i.

00:47:08.276 --> 00:47:11.571
Okay, so let's see, so we've done that,

00:47:11.571 --> 00:47:14.738
and so now just see, one more example.

00:47:16.647 --> 00:47:18.094
Now our last thing we need to find is

00:47:18.094 --> 00:47:21.031
the gradient with respect to q-I.

00:47:21.031 --> 00:47:24.824
So here if we compute the
partial derivatives we can see

00:47:24.824 --> 00:47:28.574
that dq-k over dx-i is
equal to W-k,i, right,

00:47:29.535 --> 00:47:32.702
using the same way as we did it for W,

00:47:34.833 --> 00:47:38.702
and then again we can
just use the chain rule

00:47:38.702 --> 00:47:43.127
and get the total
expression for that, right?

00:47:43.127 --> 00:47:44.902
And so this is going to be the gradient

00:47:44.902 --> 00:47:48.350
with respect to x, again,
of the same shape as x,

00:47:48.350 --> 00:47:49.522
and we can also write this out

00:47:49.522 --> 00:47:52.022
in vectorized form if we want.

00:47:53.529 --> 00:47:56.442
Okay, so any questions about this, yeah?

00:47:56.442 --> 00:47:59.100
[student speaking away from microphone]

00:47:59.100 --> 00:48:01.850
So we are computing the Jacobian,

00:48:03.194 --> 00:48:05.694
so let me go back here, right,

00:48:07.414 --> 00:48:10.774
so if we're doing, so right, so we have

00:48:10.774 --> 00:48:12.873
these partial derivatives of q-k

00:48:12.873 --> 00:48:16.439
with respect to x-i, right, and these

00:48:16.439 --> 00:48:20.714
are forming your, the entries
of your Jacobian, right?

00:48:20.714 --> 00:48:22.879
And so in practice what we're going to do

00:48:22.879 --> 00:48:26.306
is we basically take that,
and you're going to see it

00:48:26.306 --> 00:48:29.222
up there in the chain rule,
so the vectorized expression

00:48:29.222 --> 00:48:31.259
of gradient with respect to x, right,

00:48:31.259 --> 00:48:33.293
this is going to have the Jacobian here

00:48:33.293 --> 00:48:36.332
which is this transposed value here,

00:48:36.332 --> 00:48:38.926
so you can write it
out in vectorized form.

00:48:38.926 --> 00:48:43.489
[student speaking away from microphone]

00:48:43.489 --> 00:48:44.923
So well, so in this case the matrix

00:48:44.923 --> 00:48:47.636
is going to be the same size as W right,

00:48:47.636 --> 00:48:51.803
so it's not actually a large
matrix in this case, right?

00:48:55.878 --> 00:48:58.941
Okay, so the way that we've
been thinking about this

00:48:58.941 --> 00:49:01.764
is like a really modularized
implementation, right,

00:49:01.764 --> 00:49:05.305
where in our computational graph, right,

00:49:05.305 --> 00:49:07.678
we look at each node
locally and we compute

00:49:07.678 --> 00:49:09.046
the local gradients and chain them

00:49:09.046 --> 00:49:10.965
with upstream gradients coming down,

00:49:10.965 --> 00:49:13.181
and so you can think of this as basically

00:49:13.181 --> 00:49:15.337
a forward and a backwards API, right?

00:49:15.337 --> 00:49:19.186
In the forward pass we
implement the, you know,

00:49:19.186 --> 00:49:22.396
a function computing
the output of this node,

00:49:22.396 --> 00:49:25.011
and then in the backwards
pass we compute the gradient.

00:49:25.011 --> 00:49:27.295
And so when we actually
implement this in code,

00:49:27.295 --> 00:49:30.592
we're going to do this
in exactly the same way.

00:49:30.592 --> 00:49:34.865
So we can basically think
about, for each gate, right,

00:49:34.865 --> 00:49:38.464
if we implement a forward
function and a backward function,

00:49:38.464 --> 00:49:40.908
where the backward function
is computing the chain rule,

00:49:40.908 --> 00:49:45.572
then if we have our entire
graph, we can just make

00:49:45.572 --> 00:49:48.833
a forward pass through the
entire graph by iterating through

00:49:48.833 --> 00:49:51.059
all the nodes in the graph, all the gates.

00:49:51.059 --> 00:49:52.399
Here I'm going to use
the word gate and node,

00:49:52.399 --> 00:49:54.768
kind of interchangeably,
we can iterate through

00:49:54.768 --> 00:49:57.194
all of these gates and just call forward

00:49:57.194 --> 00:49:58.803
on each of the gates, right?

00:49:58.803 --> 00:50:01.349
And we just want to do this
in topologically sorted order,

00:50:01.349 --> 00:50:03.357
so we process all of
the inputs coming in to

00:50:03.357 --> 00:50:06.332
a node before we process that node.

00:50:06.332 --> 00:50:08.576
And then going backwards,
we're just going to then

00:50:08.576 --> 00:50:11.565
go through all of the gates
in this reverse sorted order,

00:50:11.565 --> 00:50:15.482
and then call backwards
on each of these gates.

00:50:16.564 --> 00:50:19.147
Okay, and so if we look at then

00:50:20.225 --> 00:50:22.285
the implementation for
our particular gates,

00:50:22.285 --> 00:50:24.596
so for example, this MultiplyGate here,

00:50:24.596 --> 00:50:26.960
we want to implement
the forward pass, right,

00:50:26.960 --> 00:50:31.127
so it gets x and y as inputs,
and returns the value of z,

00:50:32.520 --> 00:50:34.162
and then when we go backwards, right,

00:50:34.162 --> 00:50:37.216
we get as input dz, which
is our upstream gradient,

00:50:37.216 --> 00:50:40.167
and we want to output the gradients

00:50:40.167 --> 00:50:42.523
on the input's x and
y to pass down, right?

00:50:42.523 --> 00:50:45.190
So we're going to output dx and dy,

00:50:46.426 --> 00:50:49.110
and so in this case, in this example,

00:50:49.110 --> 00:50:51.567
everything is back to
the scalar case here,

00:50:51.567 --> 00:50:55.574
and so if we look at
this in the forward pass,

00:50:55.574 --> 00:50:57.215
one thing that's important
is that we need to,

00:50:57.215 --> 00:50:59.327
we should cache the values
of the forward pass, right,

00:50:59.327 --> 00:51:01.043
because we end up using this in

00:51:01.043 --> 00:51:03.156
the backward pass a lot of the time.

00:51:03.156 --> 00:51:06.061
So here in the forward pass,
we want to cache the values

00:51:06.061 --> 00:51:10.594
of x and y, right, and
in the backward pass,

00:51:10.594 --> 00:51:12.930
using the chain rule,
we're going to, remember,

00:51:12.930 --> 00:51:14.348
take the value of the upstream gradient

00:51:14.348 --> 00:51:17.345
and scale it by the value
of the other branch, right,

00:51:17.345 --> 00:51:20.294
and so we'll keep, for
dx we'll take our value

00:51:20.294 --> 00:51:22.960
of self.y that we kept, and multiply it

00:51:22.960 --> 00:51:25.877
by dz coming down, and same for dy.

00:51:28.799 --> 00:51:32.879
Okay, so if you look at a lot
of deep-learning frameworks

00:51:32.879 --> 00:51:35.358
and libraries you'll see
that they exactly follow

00:51:35.358 --> 00:51:37.873
this kind of modularization, right?

00:51:37.873 --> 00:51:42.040
So for example, Caffe is a
popular deep learning framework,

00:51:43.284 --> 00:51:46.047
and you'll see, if you go look
through the Caffe source code

00:51:46.047 --> 00:51:48.351
you'll get to some
directory that says layers,

00:51:48.351 --> 00:51:52.443
and in layers, which are
basically computational nodes,

00:51:52.443 --> 00:51:54.400
usually layers might be
slightly more, you know,

00:51:54.400 --> 00:51:56.217
some of these more complex
computational nodes

00:51:56.217 --> 00:51:58.671
like the sigmoid that
we talked about earlier,

00:51:58.671 --> 00:52:01.555
you'll see, basically just a whole list

00:52:01.555 --> 00:52:04.342
of all different kinds of
computational nodes, right?

00:52:04.342 --> 00:52:06.132
So you might have the sigmoid, and I know

00:52:06.132 --> 00:52:09.205
there might be here, there's
like a convolution is one,

00:52:09.205 --> 00:52:11.925
there's an Argmax is another layer,

00:52:11.925 --> 00:52:14.055
you'll have all of these
layers and if you dig in

00:52:14.055 --> 00:52:16.340
to each of them, they're
just exactly implementing

00:52:16.340 --> 00:52:18.355
a forward pass and a backward pass,

00:52:18.355 --> 00:52:20.929
and then all of these are called

00:52:20.929 --> 00:52:23.072
when we do forward and backward pass

00:52:23.072 --> 00:52:25.130
through the entire network that we formed,

00:52:25.130 --> 00:52:27.372
and so our network is just basically going

00:52:27.372 --> 00:52:30.393
to be stacking up all of these,

00:52:30.393 --> 00:52:34.467
the different layers that we
choose to use in the network.

00:52:34.467 --> 00:52:37.560
So for example, if we
look at a specific one,

00:52:37.560 --> 00:52:40.613
in this case a sigmoid layer, you'll see

00:52:40.613 --> 00:52:42.281
that in the sigmoid layer, right,

00:52:42.281 --> 00:52:44.658
we've talked about the sigmoid function,

00:52:44.658 --> 00:52:47.140
you'll see that there's a forward pass

00:52:47.140 --> 00:52:51.443
which basically computes
exactly the sigmoid expression,

00:52:51.443 --> 00:52:53.084
and then a backward pass, right,

00:52:53.084 --> 00:52:57.251
where it is taking as input
something, basically a top_diff,

00:53:00.031 --> 00:53:02.158
which is our upstream
gradient in this case,

00:53:02.158 --> 00:53:06.325
and multiplying it by a local
gradient that we compute.

00:53:08.267 --> 00:53:10.539
So in assignment one you'll get practice

00:53:10.539 --> 00:53:15.277
with this kind of, this
computational graph way of thinking

00:53:15.277 --> 00:53:16.829
where, you know, you're
going to be writing

00:53:16.829 --> 00:53:19.279
your SVM and Softmax classes,

00:53:19.279 --> 00:53:21.041
and taking the gradients of these.

00:53:21.041 --> 00:53:24.005
And so again, remember always
you want to first step,

00:53:24.005 --> 00:53:27.860
represent it as a
computational graph, right?

00:53:27.860 --> 00:53:29.492
Figure out what are all the computations

00:53:29.492 --> 00:53:31.632
that you did leading up to the output,

00:53:31.632 --> 00:53:32.928
and then when you, when it's time

00:53:32.928 --> 00:53:35.786
to do your backward pass,
just take the gradient

00:53:35.786 --> 00:53:38.430
with respect to each of
these intermediate variables

00:53:38.430 --> 00:53:42.262
that you've defined in
your computational graph,

00:53:42.262 --> 00:53:46.345
and use the chain rule to
link them all together.

00:53:47.630 --> 00:53:50.726
Okay, so summary of what
we've talked about so far.

00:53:50.726 --> 00:53:53.037
When we get down to, you know,
working with neural networks,

00:53:53.037 --> 00:53:55.112
these are going to be
really large and complex,

00:53:55.112 --> 00:53:56.859
so it's going to be
impractical to write down

00:53:56.859 --> 00:54:01.175
the gradient formula by hand
for all your parameters.

00:54:01.175 --> 00:54:04.864
So in order to get these gradients, right,

00:54:04.864 --> 00:54:08.866
we talked about how, what we
should use is backpropagation,

00:54:08.866 --> 00:54:12.226
right, and this is kind of
one of the core techniques

00:54:12.226 --> 00:54:15.809
of, you know, neural
networks, is basically

00:54:16.660 --> 00:54:19.328
using backpropagation to
get your gradients, right?

00:54:19.328 --> 00:54:21.032
And so this is a recursive application of

00:54:21.032 --> 00:54:23.942
the chain rule where we have
this computational graph,

00:54:23.942 --> 00:54:26.298
and we start at the back and
we go backwards through it

00:54:26.298 --> 00:54:27.616
to compute the gradients with respect

00:54:27.616 --> 00:54:29.984
to all of the intermediate variables,

00:54:29.984 --> 00:54:31.806
which are your inputs, your parameters,

00:54:31.806 --> 00:54:35.369
and everything else in the middle.

00:54:35.369 --> 00:54:37.344
And we've also talked about how really

00:54:37.344 --> 00:54:40.642
this implementation and
this graph structure,

00:54:40.642 --> 00:54:42.746
each of these nodes is
really, you can see this

00:54:42.746 --> 00:54:45.671
as implementing a forward
and backwards API, right?

00:54:45.671 --> 00:54:47.976
And so in the forward
pass we want to compute

00:54:47.976 --> 00:54:49.881
the results of the operation, and we want

00:54:49.881 --> 00:54:51.937
to save any intermediate values

00:54:51.937 --> 00:54:55.149
that we might want to use later
in our gradient computation,

00:54:55.149 --> 00:54:58.676
and then in the backwards
pass we apply this chain rule

00:54:58.676 --> 00:55:00.200
and we take this upstream gradient,

00:55:00.200 --> 00:55:03.101
we chain it, multiply it
with our local gradient

00:55:03.101 --> 00:55:05.205
to compute the gradient with respect to

00:55:05.205 --> 00:55:08.240
the inputs of the node,
and we pass this down

00:55:08.240 --> 00:55:11.323
to the nodes that are connected next.

00:55:12.906 --> 00:55:15.263
Okay, so now finally we're going

00:55:15.263 --> 00:55:17.600
to talk about neural networks.

00:55:17.600 --> 00:55:21.600
All right, so really, you
know, neural networks,

00:55:24.254 --> 00:55:26.429
people draw a lot of analogies
between neural networks

00:55:26.429 --> 00:55:27.794
and the brain, and different types

00:55:27.794 --> 00:55:30.225
of biological inspirations,
and we'll get to that in

00:55:30.225 --> 00:55:33.266
a little bit, but first let's
talk about it, you know,

00:55:33.266 --> 00:55:34.670
just looking at it as a function,

00:55:34.670 --> 00:55:39.385
as a class of functions
without all of the brain stuff.

00:55:39.385 --> 00:55:41.227
So, so far we've talked about, you know,

00:55:41.227 --> 00:55:43.748
we've worked a lot with this
linear score function, right?

00:55:43.748 --> 00:55:46.573
f equals W times x, and
so we've been using this

00:55:46.573 --> 00:55:50.740
as a running example of a
function that we want to optimize.

00:55:51.716 --> 00:55:55.195
So instead of using the
single in your transformation,

00:55:55.195 --> 00:55:57.138
if we want a neural network where we

00:55:57.138 --> 00:55:59.066
can just, as the simplest form,

00:55:59.066 --> 00:56:01.250
just stack two of these together, right?

00:56:01.250 --> 00:56:04.630
Just a linear transformation
on top of another one

00:56:04.630 --> 00:56:08.122
in order to get a two-layer
neural network, right?

00:56:08.122 --> 00:56:11.451
And so what this looks like is
first we have our, you know,

00:56:11.451 --> 00:56:14.284
a matrix multiply of W-one with x,

00:56:15.921 --> 00:56:19.342
and then we get this intermediate variable

00:56:19.342 --> 00:56:23.509
and we have this non-linear
function of a max of zero

00:56:24.761 --> 00:56:28.706
with W, max with this
output of this linear layer,

00:56:28.706 --> 00:56:32.093
and it's really important to
have these non-linearities

00:56:32.093 --> 00:56:34.050
in place, which we'll
talk about more later,

00:56:34.050 --> 00:56:36.480
because otherwise if you just
stack linear layers on top

00:56:36.480 --> 00:56:38.203
of each other, they're
just going to collapse

00:56:38.203 --> 00:56:40.976
to, like a single linear function.

00:56:40.976 --> 00:56:43.145
Okay, so we have our first linear layer

00:56:43.145 --> 00:56:45.846
and then we have this
non-linearity, right,

00:56:45.846 --> 00:56:49.347
and then on top of this we'll
add another linear layer.

00:56:49.347 --> 00:56:52.310
And then from here, finally
we can get our score function,

00:56:52.310 --> 00:56:54.962
our output vector of scores.

00:56:54.962 --> 00:56:57.562
So basically, like, more broadly speaking,

00:56:57.562 --> 00:57:00.189
neural networks are a class of functions

00:57:00.189 --> 00:57:02.652
where we have simpler functions, right,

00:57:02.652 --> 00:57:04.278
that are stacked on top of each other,

00:57:04.278 --> 00:57:06.261
and we stack them in a hierarchical way

00:57:06.261 --> 00:57:10.078
in order to make up a more
complex non-linear function,

00:57:10.078 --> 00:57:13.827
and so this is the idea of
having, basically multiple stages

00:57:13.827 --> 00:57:16.702
of hierarchical computation, right?

00:57:16.702 --> 00:57:19.455
And so, you know, so this is kind of

00:57:19.455 --> 00:57:22.589
the main way that we do this is by taking

00:57:22.589 --> 00:57:26.220
something like this matrix
multiply, this linear layer,

00:57:26.220 --> 00:57:30.204
and we just stack multiple
of these on top of each other

00:57:30.204 --> 00:57:33.871
with non-linear functions
in-between, right?

00:57:38.446 --> 00:57:41.185
And so one thing that this
can help solve is if we look,

00:57:41.185 --> 00:57:43.135
if we remember back to
this linear score function

00:57:43.135 --> 00:57:45.170
that we were talking about, right,

00:57:45.170 --> 00:57:49.390
remember we discussed earlier how each row

00:57:49.390 --> 00:57:53.283
of our weight matrix W was
something like a template.

00:57:53.283 --> 00:57:55.537
It was a template that sort
of expressed, you know,

00:57:55.537 --> 00:57:58.697
what we're looking for in the input

00:57:58.697 --> 00:58:01.049
for a specific class, right,
so for example, you know,

00:58:01.049 --> 00:58:02.394
the car template looks something like

00:58:02.394 --> 00:58:06.110
this kind of fuzzy red car,
and we were looking for this

00:58:06.110 --> 00:58:10.151
in the input to compute the
score for the car class.

00:58:10.151 --> 00:58:11.588
And we talked about one
of the problems with this

00:58:11.588 --> 00:58:13.566
is that there's only one template, right?

00:58:13.566 --> 00:58:15.848
There's this red car, whereas in practice,

00:58:15.848 --> 00:58:17.129
we actually have multiple modes, right?

00:58:17.129 --> 00:58:19.537
We might want, we're looking
for, you know, a red car,

00:58:19.537 --> 00:58:21.225
there's also a yellow
car, like all of these

00:58:21.225 --> 00:58:24.293
are different kinds of cars, and so what

00:58:24.293 --> 00:58:28.210
this kind of multiple
layer network lets you do

00:58:29.287 --> 00:58:33.666
is now, you know, each of
this intermediate variable h,

00:58:33.666 --> 00:58:36.785
right, W-one can still be
these kinds of templates,

00:58:36.785 --> 00:58:38.943
but now you have all of these scores

00:58:38.943 --> 00:58:42.006
for these templates in h,
and we can have another layer

00:58:42.006 --> 00:58:45.503
on top that's combining
these together, right?

00:58:45.503 --> 00:58:48.693
So we can say that actually
my car class should be,

00:58:48.693 --> 00:58:50.525
you know, connected to, we're looking

00:58:50.525 --> 00:58:53.602
for both red cars as well
as yellow cars, right,

00:58:53.602 --> 00:58:56.652
because we have this matrix W-two

00:58:56.652 --> 00:59:00.819
which is now a weighting
of all of our vector in h.

00:59:05.691 --> 00:59:08.478
Okay, any questions about this?

00:59:08.478 --> 00:59:09.311
Yeah?

00:59:09.311 --> 00:59:13.795
[student speaking away from microphone]

00:59:13.795 --> 00:59:15.529
Yeah, so there's a lot of ways,

00:59:15.529 --> 00:59:17.189
so there's a lot of different
non-linear functions

00:59:17.189 --> 00:59:20.821
that you can choose from,
and we'll talk later on

00:59:20.821 --> 00:59:22.537
in a later lecture about
all the different kinds

00:59:22.537 --> 00:59:25.880
of non-linearities that
you might want to use.

00:59:25.880 --> 00:59:28.599
- [Student] For the pictures in the slide,

00:59:28.599 --> 00:59:31.411
so, on the bottom row you have images

00:59:31.411 --> 00:59:34.828
of your vector W-one weight, and so maybe

00:59:38.703 --> 00:59:42.536
you would have images
of another vector W-two?

00:59:46.014 --> 00:59:49.534
- So W-one, because it's
directly connected to

00:59:49.534 --> 00:59:52.965
the input x, this is what's
like, really interpretable,

00:59:52.965 --> 00:59:56.909
because you can formulate
all of these templates.

00:59:56.909 --> 00:59:59.492
W-two, so h is going to be a score

01:00:00.389 --> 01:00:03.570
of how much of each template
you solve, for example,

01:00:03.570 --> 01:00:06.047
all right, so it might be
like you have a, you know,

01:00:06.047 --> 01:00:09.640
like a, I don't know, two for the red car,

01:00:09.640 --> 01:00:11.865
and like, one for the yellow
car or something like that.

01:00:11.865 --> 01:00:16.678
- [Student] Oh, okay, so
instead of W-one being just 10,

01:00:16.678 --> 01:00:19.327
like, you would have a left-facing horse

01:00:19.327 --> 01:00:21.807
and a right-facing horse,
and they'd both be included--

01:00:21.807 --> 01:00:25.864
- Exactly, so the question
is basically whether

01:00:25.864 --> 01:00:27.849
in W-one you could have
both left-facing horse

01:00:27.849 --> 01:00:30.532
and right-facing horse,
right, and so yeah, exactly.

01:00:30.532 --> 01:00:34.590
So now W-one can be many different
kinds of templates right?

01:00:34.590 --> 01:00:38.505
They're not, and then W-two,
now we can, like basically

01:00:38.505 --> 01:00:41.304
it's a weighted sum of
all of these templates.

01:00:41.304 --> 01:00:43.461
So now it allows you to weight
together multiple templates

01:00:43.461 --> 01:00:47.014
in order to get the final
score for a particular class.

01:00:47.014 --> 01:00:48.416
- [Student] So if you're
processing an image

01:00:48.416 --> 01:00:50.713
then it's actually left-facing horse.

01:00:50.713 --> 01:00:53.522
It'll get a really high score with

01:00:53.522 --> 01:00:55.297
the left-facing horse template,

01:00:55.297 --> 01:00:58.186
and a lower score with the
right-facing horse template,

01:00:58.186 --> 01:01:02.552
and then this will take
the maximum of the two?

01:01:02.552 --> 01:01:04.272
- Right, so okay, so the question is,

01:01:04.272 --> 01:01:08.687
if our image x is like a left-facing horse

01:01:08.687 --> 01:01:10.519
and in W-one we have a template of

01:01:10.519 --> 01:01:13.140
a left-facing horse and
a right-facing horse,

01:01:13.140 --> 01:01:14.435
then what's happening, right?

01:01:14.435 --> 01:01:17.390
So what happens is yeah,
so in h you might have

01:01:17.390 --> 01:01:21.199
a really high score for
your left-facing horse,

01:01:21.199 --> 01:01:24.627
kind of a lower score for
your right-facing horse,

01:01:24.627 --> 01:01:27.976
and W-two is, it's a weighted
sum, so it's not a maximum.

01:01:27.976 --> 01:01:30.749
It's a weighted sum of these templates,

01:01:30.749 --> 01:01:33.215
but if you have either a really high score

01:01:33.215 --> 01:01:35.597
for one of these templates,
or let's say you have, kind of

01:01:35.597 --> 01:01:38.413
a lower and medium score
for both of these templates,

01:01:38.413 --> 01:01:39.968
all of these kinds of combinations

01:01:39.968 --> 01:01:41.971
are going to give high scores, right?

01:01:41.971 --> 01:01:43.305
And so in the end what you're going to get

01:01:43.305 --> 01:01:45.153
is something that generally scores high

01:01:45.153 --> 01:01:47.986
when you have a horse of any kind.

01:01:49.195 --> 01:01:50.966
So let's say you had a front-facing horse,

01:01:50.966 --> 01:01:53.470
you might have medium values for both

01:01:53.470 --> 01:01:56.901
the left and the right templates.

01:01:56.901 --> 01:01:57.963
Yeah, question?

01:01:57.963 --> 01:01:59.994
- [Student] So is W-two
doing the weighting,

01:01:59.994 --> 01:02:01.705
or is h doing the weighting?

01:02:01.705 --> 01:02:03.506
- W-two is doing the
weighting, so the question is,

01:02:03.506 --> 01:02:06.397
"Is W-two doing the weighting
or is h doing the weighting?"

01:02:06.397 --> 01:02:09.710
h is the value, like in this example,

01:02:09.710 --> 01:02:14.144
h is the value of scores
for each of your templates

01:02:14.144 --> 01:02:17.277
that you have in W-one, right?

01:02:17.277 --> 01:02:21.128
So h is like the score function, right,

01:02:21.128 --> 01:02:25.851
it's how much of each
template in W-one is present,

01:02:25.851 --> 01:02:29.627
and then W-two is going
to weight all of these,

01:02:29.627 --> 01:02:31.341
weight all of these intermediate scores

01:02:31.341 --> 01:02:33.974
to get your final score for the class.

01:02:33.974 --> 01:02:36.194
- [Student] And which
is the non-linear thing?

01:02:36.194 --> 01:02:38.004
- So the question is, "which
is the non-linear thing?"

01:02:38.004 --> 01:02:42.171
So the non-linearity usually
happens right before h,

01:02:43.643 --> 01:02:47.896
so h is the value right
after the non-linearity.

01:02:47.896 --> 01:02:49.490
So we're talking about
this, like, you know,

01:02:49.490 --> 01:02:51.566
intuitively as this example of like,

01:02:51.566 --> 01:02:53.928
W-one is looking for, you know,

01:02:53.928 --> 01:02:55.363
has these same templates as before,

01:02:55.363 --> 01:02:57.152
and W-two is a weighting for these.

01:02:57.152 --> 01:02:59.314
In practice it's not
exactly like this, right,

01:02:59.314 --> 01:03:01.255
because as you said, there's all

01:03:01.255 --> 01:03:03.358
these non-linearities thrown in and so on,

01:03:03.358 --> 01:03:07.525
but it has this approximate
type of interpretation to it.

01:03:08.435 --> 01:03:11.602
- [Student] So h is just W-one-x then?

01:03:13.811 --> 01:03:16.715
- Yeah, yeah, so the
question is h just W-one-x?

01:03:16.715 --> 01:03:20.882
So h is just W-one times x,
with the max function on top.

01:03:27.004 --> 01:03:29.087
Oh, let me just, okay so,

01:03:30.259 --> 01:03:31.656
so we've talked about
this as an example of

01:03:31.656 --> 01:03:34.390
a two-layer neural network,
and we can stack more layers

01:03:34.390 --> 01:03:37.221
of these to get deeper networks
of arbitrary depth, right?

01:03:37.221 --> 01:03:39.202
So we can just do this one more time

01:03:39.202 --> 01:03:43.369
at another non-linearity and
matrix multiply now by W-three,

01:03:44.943 --> 01:03:47.773
and now we have a three-layer
neural network, right?

01:03:47.773 --> 01:03:49.921
And so this is where the
term deep neural networks

01:03:49.921 --> 01:03:51.604
is basically coming from, right?

01:03:51.604 --> 01:03:55.081
This idea that you can stack
multiple of these layers,

01:03:55.081 --> 01:03:57.831
you know, for very deep networks.

01:04:00.153 --> 01:04:03.486
And so in homework you'll get a practice

01:04:04.453 --> 01:04:07.623
of writing and you know, training one of

01:04:07.623 --> 01:04:10.363
these neural networks, I
think in assignment two,

01:04:10.363 --> 01:04:13.534
but basically a full
implementation of this using

01:04:13.534 --> 01:04:15.533
this idea of forward pass, right,

01:04:15.533 --> 01:04:18.078
and backward passes, and using chain rule

01:04:18.078 --> 01:04:20.497
to compute gradients
that we've already seen.

01:04:20.497 --> 01:04:22.924
The entire implementation of
a two-layer neural network

01:04:22.924 --> 01:04:27.444
is actually really simple, it
can just be done in 20 lines,

01:04:27.444 --> 01:04:31.131
and so you'll get some practice
with this in assignment two,

01:04:31.131 --> 01:04:33.761
writing out all of these parts.

01:04:33.761 --> 01:04:36.269
And okay, so now that we've sort of seen

01:04:36.269 --> 01:04:38.182
what neural networks are
as a function, right,

01:04:38.182 --> 01:04:41.005
like, you know, we hear
people talking a lot about

01:04:41.005 --> 01:04:44.613
how there's biological
inspirations for neural networks,

01:04:44.613 --> 01:04:47.816
and so even though it's
important that to emphasize

01:04:47.816 --> 01:04:50.926
that these analogies are really loose,

01:04:50.926 --> 01:04:53.875
it's really just very loose ties,

01:04:53.875 --> 01:04:56.303
but it's still interesting to understand

01:04:56.303 --> 01:04:59.149
where some of these connections
and inspirations come from.

01:04:59.149 --> 01:05:03.445
And so now I'm going to
talk briefly about that.

01:05:03.445 --> 01:05:05.712
So if we think about a neuron, in kind of

01:05:05.712 --> 01:05:08.545
a very simple way, this neuron is,

01:05:09.396 --> 01:05:11.270
here's a diagram of a neuron.

01:05:11.270 --> 01:05:14.037
We have the impulses
that are carried towards

01:05:14.037 --> 01:05:15.144
each neuron, right, so we have a lot

01:05:15.144 --> 01:05:19.503
of neurons connected together
and each neuron has dendrites,

01:05:19.503 --> 01:05:22.008
right, and these are sort
of, these are what receives

01:05:22.008 --> 01:05:25.174
the impulses that come into the neuron.

01:05:25.174 --> 01:05:26.706
And then we have a cell body, right,

01:05:26.706 --> 01:05:30.362
that basically integrates
these signals coming in

01:05:30.362 --> 01:05:34.018
and then there's a kind
of, then it takes this,

01:05:34.018 --> 01:05:36.927
and after integrating all
these signals, it passes on,

01:05:36.927 --> 01:05:38.786
you know, the impulse carries away from

01:05:38.786 --> 01:05:42.543
the cell body to downstream
neurons that it's connected to,

01:05:42.543 --> 01:05:46.376
right, and it carries
this away through axons.

01:05:48.337 --> 01:05:51.580
So now if we look at what
we've been doing so far, right,

01:05:51.580 --> 01:05:54.401
with each computational node, you can see

01:05:54.401 --> 01:05:57.397
that this actually has, you can see it

01:05:57.397 --> 01:05:59.592
in kind of a similar way, right?

01:05:59.592 --> 01:06:01.694
Where nodes are connected to each other in

01:06:01.694 --> 01:06:05.688
the computational graph,
and we have inputs,

01:06:05.688 --> 01:06:09.983
or signals, x, x right,
coming into a neuron,

01:06:09.983 --> 01:06:13.732
and then all of these x,
right, x-zero, x-one, x-two,

01:06:13.732 --> 01:06:16.712
these are combined and
integrated together, right,

01:06:16.712 --> 01:06:19.812
using, for example our weights, W.

01:06:19.812 --> 01:06:22.821
So we do some sort of computation, right,

01:06:22.821 --> 01:06:25.109
and in some of the computations
we've been doing so far,

01:06:25.109 --> 01:06:29.381
something like W times x plus b, right,

01:06:29.381 --> 01:06:31.610
integrating all these together,

01:06:31.610 --> 01:06:33.432
and then we have an activation function

01:06:33.432 --> 01:06:36.931
that we apply on top, we get
this value of this output,

01:06:36.931 --> 01:06:40.764
and we pass it down to
the connecting neurons.

01:06:41.733 --> 01:06:44.068
So if you look at that
this, this is actually,

01:06:44.068 --> 01:06:45.898
you can think about this in
a very similar way, right?

01:06:45.898 --> 01:06:49.235
Like, you know, these are
what's the signals coming in

01:06:49.235 --> 01:06:53.404
are kind of the, connected
at synapses, right?

01:06:53.404 --> 01:06:56.289
The synapse connecting
the multiple neurons,

01:06:56.289 --> 01:06:59.745
the dendrites are
integrating all of these,

01:06:59.745 --> 01:07:02.727
they're integrating all of
this information together in

01:07:02.727 --> 01:07:04.226
the cell body, and then we have

01:07:04.226 --> 01:07:08.489
the output carried on the output later on.

01:07:08.489 --> 01:07:10.261
And so this is kind of the analogy

01:07:10.261 --> 01:07:11.958
that you can draw between them,

01:07:11.958 --> 01:07:15.240
and if you look at these
activation functions, right?

01:07:15.240 --> 01:07:18.710
This is what basically takes
all the inputs coming in

01:07:18.710 --> 01:07:21.818
and outputs one number
that's going out later on,

01:07:21.818 --> 01:07:23.131
and we've talked about examples

01:07:23.131 --> 01:07:26.223
like sigmoid activation function, right,

01:07:26.223 --> 01:07:28.294
and different kinds of non-linearities,

01:07:28.294 --> 01:07:32.461
and so sort of one kind of
loose analogy that you can draw

01:07:34.601 --> 01:07:37.340
is that these
non-linearities can represent

01:07:37.340 --> 01:07:39.831
something sort of like the firing,

01:07:39.831 --> 01:07:42.368
or spiking rate of the neurons, right?

01:07:42.368 --> 01:07:46.771
Where our neurons transmit
signals to connecting neurons

01:07:46.771 --> 01:07:49.174
using kind of these
discrete spikes, right?

01:07:49.174 --> 01:07:51.215
And so we can think of, you know,

01:07:51.215 --> 01:07:53.909
if they're spiking very
fast then there's kind of

01:07:53.909 --> 01:07:56.664
a strong signal that's passed later on,

01:07:56.664 --> 01:07:58.266
and so we can think of this value

01:07:58.266 --> 01:08:02.035
after our activation function
as sort of, in a sense,

01:08:02.035 --> 01:08:05.335
sort of this firing rate
that we're going to pass on.

01:08:05.335 --> 01:08:08.084
And you know in practice,
I think neuroscientists

01:08:08.084 --> 01:08:09.916
who are actually studying this say

01:08:09.916 --> 01:08:11.614
that kind of one of the non-linearities

01:08:11.614 --> 01:08:14.493
that are most similar to the way

01:08:14.493 --> 01:08:16.247
that neurons are actually behaving

01:08:16.247 --> 01:08:19.560
is a ReLU function, which
is a ReLU non-linearity,

01:08:19.560 --> 01:08:20.535
which is something that we're going

01:08:20.536 --> 01:08:23.423
to look at more later
on, but it's a function

01:08:23.423 --> 01:08:28.160
that's at zero for all
negative values of input,

01:08:28.160 --> 01:08:31.716
and then it's a linear
function for everything

01:08:31.716 --> 01:08:34.072
that's in kind of a positive regime.

01:08:34.072 --> 01:08:35.362
And so, you know, we'll talk more about

01:08:35.362 --> 01:08:36.727
this activation function later on,

01:08:36.727 --> 01:08:39.176
but that's kind of, in practice,

01:08:39.176 --> 01:08:41.171
maybe the one that's most similar to

01:08:41.171 --> 01:08:44.004
how neurons are actually behaving.

01:08:46.020 --> 01:08:49.046
But it's really important
to be extremely careful

01:08:49.046 --> 01:08:52.056
with making any of these
sorts of brain analogies,

01:08:52.057 --> 01:08:53.978
because in practice biological neurons

01:08:53.978 --> 01:08:56.388
are way more complex than this.

01:08:56.388 --> 01:09:00.216
There's many different
kinds of biological neurons,

01:09:00.216 --> 01:09:01.352
the dendrites can perform

01:09:01.352 --> 01:09:04.951
really complex non-linear computations.

01:09:04.951 --> 01:09:07.884
Our synapses, right, the
W-zeros that we had earlier

01:09:07.884 --> 01:09:11.381
where we drew this analogy,
are not single weights

01:09:11.381 --> 01:09:14.033
like we had, they're actually
really complex, you know,

01:09:14.033 --> 01:09:17.087
non-linear dynamical systems in practice,

01:09:17.087 --> 01:09:21.602
and also this idea of interpreting
our activation function

01:09:21.602 --> 01:09:25.524
as a sort of rate code
or firing rate is also,

01:09:25.524 --> 01:09:28.522
is insufficient in practice, you know.

01:09:28.523 --> 01:09:32.251
It's just this kind of
firing rate is probably not

01:09:32.251 --> 01:09:34.671
a sufficient model of how neurons

01:09:34.671 --> 01:09:37.214
will actually communicate to
downstream neurons, right,

01:09:37.215 --> 01:09:41.366
like even as a very simple
way, there's a very,

01:09:41.366 --> 01:09:44.549
the neurons will fire at a variable rate,

01:09:44.549 --> 01:09:47.843
and this variability probably
should be taken into account.

01:09:47.843 --> 01:09:49.913
And so there's all of these, you know,

01:09:49.913 --> 01:09:53.357
it's kind of a much more complex thing

01:09:53.358 --> 01:09:56.226
than what we're dealing with.

01:09:56.226 --> 01:10:00.692
There's references, for example
this dendritic computation

01:10:00.692 --> 01:10:03.944
that you can look at if you're
interested in this topic,

01:10:03.944 --> 01:10:06.374
but yeah, so that in practice, you know,

01:10:06.374 --> 01:10:08.701
we can sort of see how
it may resemble a neuron

01:10:08.701 --> 01:10:11.883
at this very high level, but neurons are,

01:10:11.883 --> 01:10:15.916
in practice, much more
complicated than that.

01:10:15.916 --> 01:10:18.998
Okay, so we talked about how
there's many different kinds

01:10:18.998 --> 01:10:21.118
of activation functions
that could be used,

01:10:21.118 --> 01:10:24.065
there's the ReLU that I mentioned earlier,

01:10:24.065 --> 01:10:25.820
and we'll talk about all
of these different kinds

01:10:25.820 --> 01:10:29.828
of activation functions in
much more detail later on,

01:10:29.828 --> 01:10:31.868
choices of these activation functions

01:10:31.868 --> 01:10:34.474
that you might want to use.

01:10:34.474 --> 01:10:36.140
And so we'll also talk
about different kinds

01:10:36.140 --> 01:10:38.908
of neural network architectures.

01:10:38.908 --> 01:10:42.825
So we gave the example
of these fully connected

01:10:44.362 --> 01:10:46.243
neural networks, right,
where each layer is

01:10:46.243 --> 01:10:50.410
this matrix multiply, and
so the way we actually want

01:10:52.362 --> 01:10:55.969
to call these is, we said
two-layer neural network before,

01:10:55.969 --> 01:10:58.535
and that corresponded to
the fact that we have two

01:10:58.535 --> 01:11:01.356
of these linear layers,
right, where we're doing

01:11:01.356 --> 01:11:04.374
a matrix multiply, two
fully connected layers

01:11:04.374 --> 01:11:06.507
is what we call these.

01:11:06.507 --> 01:11:09.183
We could also call this a
one-hidden-layer neural network,

01:11:09.183 --> 01:11:10.384
so instead of counting the number

01:11:10.384 --> 01:11:13.186
of matrix multiplies we're doing,

01:11:13.186 --> 01:11:15.519
counting the number of
hidden layers that we have.

01:11:15.519 --> 01:11:17.654
I think it's, you can use either,

01:11:17.654 --> 01:11:19.011
I think maybe two-layer neural network

01:11:19.011 --> 01:11:21.918
is something that's a
little more commonly used.

01:11:21.918 --> 01:11:24.033
And then also here, for our
three-layer neural network

01:11:24.033 --> 01:11:27.319
that we have, this can also be called

01:11:27.319 --> 01:11:30.152
a two-hidden-layer neural network.

01:11:32.770 --> 01:11:36.759
And so we saw that, you
know, when we're doing

01:11:36.759 --> 01:11:38.425
this type of feed forward, right,

01:11:38.425 --> 01:11:41.330
forward pass through a neural network,

01:11:41.330 --> 01:11:44.247
each of these nodes in this network

01:11:46.768 --> 01:11:50.254
is basically doing the
kind of operation of

01:11:50.254 --> 01:11:53.587
the neuron that I showed earlier, right?

01:11:55.875 --> 01:11:57.983
And so what's actually happening is

01:11:57.983 --> 01:12:00.496
is basically each hidden
layer you can think of

01:12:00.496 --> 01:12:03.777
as a whole vector, right,
a set of these neurons,

01:12:03.777 --> 01:12:06.113
and so by writing it out this way

01:12:06.113 --> 01:12:10.828
with these matrix multiplies
to compute our neuron values,

01:12:10.828 --> 01:12:13.163
it's a way that we can
efficiently evaluate

01:12:13.163 --> 01:12:14.771
this entire layer of neurons, right?

01:12:14.771 --> 01:12:18.554
So with one matrix multiply
we get output values

01:12:18.554 --> 01:12:21.664
of, you know, of a layer of let's say 10,

01:12:21.664 --> 01:12:23.664
or 50 or 100 of neurons.

01:12:26.389 --> 01:12:31.082
All right, and so looking at
this again, writing this out,

01:12:31.082 --> 01:12:34.597
all out in matrix form,
matrix-vector form,

01:12:34.597 --> 01:12:38.179
we have our, you know, non-linearity here.

01:12:38.179 --> 01:12:40.743
F that we're using, in this
case a sigmoid function, right,

01:12:40.743 --> 01:12:44.051
and we can take our data
x, some input vector

01:12:44.051 --> 01:12:48.226
or our values, and we can apply
our first matrix multiply,

01:12:48.226 --> 01:12:51.976
W-one on top of this,
then our non-linearity,

01:12:52.985 --> 01:12:54.436
then a second matrix multiply to get

01:12:54.436 --> 01:12:56.128
a second hidden layer, h-two,

01:12:56.128 --> 01:12:59.183
and then we have our final output, right?

01:12:59.183 --> 01:13:02.231
And so, you know, this
is basically all you need

01:13:02.231 --> 01:13:05.289
to be able to write a neural network,

01:13:05.289 --> 01:13:08.369
and as we saw earlier, the backward pass.

01:13:08.369 --> 01:13:11.199
You then just use backprop
to compute all of those,

01:13:11.199 --> 01:13:14.199
and so that's basically all there is

01:13:15.289 --> 01:13:19.456
to kind of the main idea
of what's a neural network.

01:13:21.451 --> 01:13:24.115
Okay, so just to
summarize, we talked about

01:13:24.115 --> 01:13:28.454
how we could arrange neurons
into these computations, right,

01:13:28.454 --> 01:13:32.043
of fully-connected or linear layers.

01:13:32.043 --> 01:13:34.642
This abstraction of a
layer has a nice property

01:13:34.642 --> 01:13:36.726
that we can use very
efficient vectorized code

01:13:36.726 --> 01:13:38.604
to compute all of these.

01:13:38.604 --> 01:13:40.601
We also talked about how it's important

01:13:40.601 --> 01:13:42.628
to keep in mind that neural networks

01:13:42.628 --> 01:13:45.805
do have some, you know,
analogy and loose inspiration

01:13:45.805 --> 01:13:48.714
from biology, but they're
not really neural.

01:13:48.714 --> 01:13:50.692
I mean, this is a pretty loose analogy

01:13:50.692 --> 01:13:53.035
that we're making, and
next time we'll talk

01:13:53.035 --> 01:13:56.319
about convolutional neural networks.

01:13:56.319 --> 00:00:00.000
Okay, thanks.